# Data visualization with python

The goal of this script is to use the [*Seaborn*](https://seaborn.pydata.org/) library using seaborn to look at word counts in corpora, and to visualize model parameters. 

The dataset provided and required for this analysis, **brown_wordlist.txt**, contains the pre-sorted counts of words in the Brown Corpora.

## Part 1. Introducing Jupyter Notebooks

**CELLS - Markdown versus Code**

This is a markdown cell.  It renders as HTML

I can type in **bold** or *italics*

- I can have bullet points

I can add LaTeX like this: $\sqrt{2+3^8}$

**Useful Shortcuts**

See more under help > Keyboard Shortcuts.

- There are 2 modes for a cell:
    - Edit mode (blue box) and
    - Command mode (green box)
- Toggle between them with ESC and Return (Enter)
- Run a cell with Ctrl + Return (Enter)
- Add a cell above with A
- Add a cell below with B

## Part 2.  Data Visualization

Let's start by importing our libraries with the conventional aliases

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

Now, let's use Pandas library to read in our data into a data frame and look at the data.

In [None]:
# Load an example dataset
brown = pd.read_csv('brown_wordlist.txt', sep=' ', header=None, names=['Count','Word'])

# Add a rank column that is equal to the index plus 1.  Python indexing starts at 0.
brown['Rank'] = brown.index + 1

# See what the data frame looks like:
print(brown)

In [None]:
# See information about the dataframe
brown.info()

In [None]:
# View data in a particular index location
brown.iloc[0]

In [None]:
# See the columns
brown.columns

In [None]:
# See the index
brown.index

### Visualization
Use seaborn library to create a visualization.  We will start with a point (scatter) plot

We may want some basic statistics about the data.  For example, let's look at the Count data.

In [None]:
brown.Count.max()

In [None]:
brown.Count.min()

In [None]:
brown.Count.mean()

In [None]:
brown.Count.quantile([0.25,0.5,0.75])

In [None]:
# Create a visualization of the count data, alone.
?sns.boxplot #get some help
sns.boxplot(data=brown, x="Count")
# or
# sns.boxplot(x=brown["Count"])

What a skew!  We need to take a better look at this data.  Let's consider the count, rank, and the words together.

In [None]:
# Create a visualization: A point plot with linear axes
sns.relplot(
    data=brown,
    x="Rank", y="Count",
    edgecolor='none'
)


We can see this skew again on both axes.  What is a good way to look at data that spans several orders of magnitude like this?

In [None]:
# Introduce log transformation
logplot = sns.relplot(
    data= brown,
    x="Rank", y="Count",
    edgecolor='none'
)
logplot.set(xscale='log')


Getting better!  What about if we take the log of the Count axis too?

In [None]:
# Introduce log-log transformation
logplot = sns.relplot(
    data= brown,
    x="Rank", y="Count",
    edgecolor='none'
)
logplot.set(xscale='log', yscale='log')


Wouldn't it be more useful if we knew what the words were?

In [None]:
logplot = sns.relplot(
    data= brown,
    x="Rank", y="Count",
    edgecolor='none'
)
logplot.set(xscale='log', yscale='log')


# Define a function, label_point
def label_point(x, y, val, ax):
    a = pd.concat({'x': x, 'y': y, 'val': val}, axis=1)
    for i, point in a.iterrows():
        ax.text(point['x']+.02, point['y'], str(point['val']))
        
# Call the function label_point, with the Word data  
# Label a selection of words
index_values = [0,1,2,5,10,25,50,100,250,1000,2500,5000,10000,25000,40000]
label_point(brown.Rank[index_values], brown.Count[index_values], brown.Word[index_values], plt.gca())  


In [None]:
# Show how to do a line plot (instead of a scatter plot)
logplot = sns.relplot(
    data= brown, 
    kind="line",
    x="Rank", y="Count"
)
logplot.set(xscale='log', yscale='log')

# Your turn!
Complete the following three questions to reinforce skills, practice abstraction, and warm up for Friday's debugging session.

## Question 1. Reinforce skills 
### Step 1. Read in the data tail_freq.csv to a pandas dataframe. Add a rank column

### Step 2.  Plot count versus rank as a point plot, and adjust axis scales if needed.  Label a few words.

## Question 2. Simplify the following code using a function
The purpose of this exercise is to identify patterns in the tasks the code is trying to acheive, and to shorten the code by defining a function that can be called for repeat tasks.  This will lessen the amount of editing you will have to do if there is an error in part of your code.  You may even find yourself saving useful functions to build your personal toolbox!

In [None]:

# Plot of top 100 ranked words
brown100 = brown[brown.Rank <= 100]
logplot = sns.relplot(data= brown100, x="Rank", y="Count", edgecolor='none')
index_values = [0,24,49,74,99]
label_point(brown100.Rank[index_values], brown100.Count[index_values], brown100.Word[index_values], plt.gca())  


In [None]:
# Plot of ranked 101-200 words
brown200 = brown[brown.Rank < 201]
brown200 = brown200[brown200.Rank > 100]
logplot = sns.relplot(data= brown200, x="Rank", y="Count", edgecolor='none')
index_values = [100,124,149,174,199] 
label_point(brown200.Rank[index_values], brown200.Count[index_values], brown200.Word[index_values], plt.gca())

In [None]:
# Plot of ranked 201-300 words
brown300 = brown[brown.Rank < 301]
brown300 = brown300[brown300.Rank > 200]
logplot = sns.relplot(data= brown300, x="Rank", y="Count", edgecolor='none')
index_values = [200,224,249,274,299] 
label_point(brown300.Rank[index_values], brown300.Count[index_values], brown300.Word[index_values], plt.gca())

## Question 3.  What do you think this code might do?  Add a print statement to figure it out.

In [None]:
# Read this program and try to predict what it does
# Run it: how accurate was your prediction?
# Refactor the program to make it more readable.

n = 10
s = 'et cetera et cetera'
print(s)

i = 0
while i < n :
    #print('at', j)
    new = ''
    for j in range(len(s)):
        left = j-1
        right = (j+1)%len(s)
        if s[left]==s[right]: 
            new += "-"
        else: 
            new += "*"
    s=''.join(new)
    i += 1  # shortcut  i = i + 1


Great work today!  Ask on the slack channel, or contact dataservices@brandeis.edu with questions.