## Do not change the code in the cell below ##

In [None]:
# The pip install can take a minute
%pip install -q urllib3<2.0 datascience ipywidgets
import pyodide_http
pyodide_http.patch_all()

from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

**Note:** In this lecture there is a lot of code. You are not expected to know any of this yet. This is just a preview of the things you will see in the next few weeks. 


## This is a Jupyter Notebook

A Jupyter Notebook is a data-science environment that combines:

1. **Narrative:** The text describing your analysis
2. **Code:** The program that does the analysis
3. **Results:** The output of the program

The Jupyter environment was created by faculty members at University of California, Berkeley (Fernando Perez). These ideas are now in a lot of different technologies (e.g., Google Collab). 


## Our first example: analyzing the text of popular books

We can use the tools of data science to study text.  For example, here we will do some basic analysis of *["The Picture of Dorian Gray"](https://en.wikipedia.org/wiki/The_Picture_of_Dorian_Gray)* (by Oscar Wilde) and from *["A Tale of Two Cities"](https://en.wikipedia.org/wiki/A_Tale_of_Two_Cities)* (by Charles Dickens).  

 

Often the first step in data sciences is getting the data.  The following is a tiny program to download text from the web. More specifically, what you see below is a **function**. We will talk more about functions later on!

In [None]:
# A tiny program to download text from the web.
def read_url(url): 
    from urllib.request import urlopen 
    import re
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

Here we download the books, which are actually hosted on the Project Gutenberg website.

In [None]:
dorian_gray_url = 'https://www.gutenberg.org/cache/epub/174/pg174.txt'
dorian_gray_text = read_url(dorian_gray_url)
dorian_gray_chapters = dorian_gray_text.split('CHAPTER ')[21:]

In [None]:
tale_of_two_cities_url = 'https://www.gutenberg.org/cache/epub/98/pg98.txt'
tale_of_two_cities_text = read_url(tale_of_two_cities_url)
tale_of_two_cities_chapters = tale_of_two_cities_text.split('CHAPTER ')[46:]

Let's look at the text from the first chapter of The Picture of Dorian Gray
:

In [None]:
dorian_gray_chapters

## Tables

- A lot of data science is about transforming data. This is often in service of producing **tables**, a widely used data structure that we can more easily analyze our data with. 
- In this class you will use the `datascience` library (specifically created for this course!!) to manipulate and data.

In [None]:
import datascience
datascience.__version__

In [None]:
from datascience import *

In [None]:
Table().with_column('Chapters', dorian_gray_chapters)

## We will learn to summarize data

We will explore data by extracting summaries. For example, we might ask, how often characters appeared in each chapter. We can use snippets of code to answer these questions.

In [None]:
import numpy as np

In [None]:
np.char.count(dorian_gray_chapters, 'Dorian')

In [None]:
np.char.count(dorian_gray_chapters, 'Henry')

In [None]:
np.char.count(dorian_gray_chapters, 'Basil')

We can convert the results of our analysis into more tables.

In [None]:
counts = Table().with_columns([
    'Dorian', np.char.count(dorian_gray_chapters, 'Dorian'),
    'Henry', np.char.count(dorian_gray_chapters, 'Henry'),
    'Basil', np.char.count(dorian_gray_chapters, 'Basil'),
])
counts

## We will learn to visualize data

- How many times is each character mentioned in Chapter 1, how many times in Chapters 1 and 2, and so on?
- As we saw above, we could answer this with a table, but there are a lot of chapters!! Let's try something else.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
cum_counts_dorian = np.cumsum(counts.column("Dorian"))
cum_counts_henry = np.cumsum(counts.column("Henry"))
cum_counts_basil = np.cumsum(counts.column("Basil"))
cumulative_table = Table().with_columns(
    'Chapter', np.arange(1, 21, 1),
    'Dorian', cum_counts_dorian,
    'Henry', cum_counts_henry,
    'Basil', cum_counts_basil
)
cumulative_table.plot(column_for_xticks='Chapter')
plt.title('Cumulative Number of Times Name Appears')
plt.show()

Here, we have plotted what we call *cumulative counts*.

What can we tell from this visualization?  What questions does this raise about the roles of Dorian, Henry and Basil in the book?

In [None]:
# The chapters of A tale of two cities
Table().with_column('Chapters', tale_of_two_cities_chapters)

We can explore the characters in A tale of two cities using the same kind of analysis.

In [None]:
# Counts of names in the chapters of a tale of two cities
names = ['Charles', 'Sydney', 'Lucie', 'Alexandre', 'Madame Defarge', 'Monsieur Defarge']
mentions = {name: np.char.count(tale_of_two_cities_chapters, name) for name in names}
counts = Table().with_columns([
        'Charles', mentions['Charles'],
        'Sydney', mentions['Sydney'],
        'Lucie', mentions['Lucie'],
        'Alexandre', mentions['Alexandre'],
        'Madame Defarge', mentions['Madame Defarge'],
        'Monsieur Defarge', mentions['Monsieur Defarge']
    ])

In [None]:
# Plot the cumulative counts
cum_counts_charles = np.cumsum(counts.column("Charles"))
cum_counts_sydney = np.cumsum(counts.column("Sydney"))
cum_counts_lucie = np.cumsum(counts.column("Lucie"))
cum_counts_alexandre = np.cumsum(counts.column("Alexandre"))
cum_counts_madame_defarge = np.cumsum(counts.column("Madame Defarge"))
cum_counts_monsieur_defarge = np.cumsum(counts.column("Monsieur Defarge"))

cumulative_table = Table().with_columns(
    'Chapter', np.arange(1, 46, 1),
    'Charles', cum_counts_charles,
    'Sydney', cum_counts_sydney,
    'Lucie', cum_counts_lucie,
    'Alexandre', cum_counts_alexandre,
    'Madame Defarge', cum_counts_madame_defarge,
    'Monsieur Defarge', cum_counts_monsieur_defarge,
)

cumulative_table.plot(column_for_xticks='Chapter')
plt.title('Cumulative Number of Times Names Appear in A Tale of Two Cities')
plt.show()

We can use interactive tools as well!

In [None]:
# Plot the cumulative counts
Table.interactive_plots()
cumulative_table.plot(column_for_xticks=0)

## Visualizing multiple variables

- How long are the chapters in a book?
- How many sentences are in a chapter? We can find where a period (full-stop) is used as a tool to help us determine this.

You don't need to worry about understanding the code below for today!!


In [None]:
len(read_url(dorian_gray_url))

In [None]:
# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods (full-stops).

length_tpdg = Table().with_columns([
        'Length', [len(s) for s in dorian_gray_chapters],
        'Periods', np.char.count(dorian_gray_chapters, '.')
    ])
length_atotc = Table().with_columns([
        'Length', [len(s) for s in tale_of_two_cities_chapters],
        'Periods', np.char.count(tale_of_two_cities_chapters, '.')
    ])

In [None]:
# The counts for The Picture of Dorian Gray
length_tpdg

In [None]:
# The counts for A tale of two cities
length_atotc

Now that we have a table for each book giving us the information on:

- length per chapter
- of periods per chapter

We might consider examining how these two variables are related. Below is what is called a **scatter plot** (we will talk more in depth about this plot later on)!

In [None]:
Table.static_plots()
plt.figure(figsize=(10,10))
plt.scatter(length_tpdg[1], length_tpdg[0], color='darkblue')
plt.scatter(length_atotc[1], length_atotc[0], color='gold')
plt.xlabel('Number of periods in chapter')
plt.ylabel('Number of characters in chapter');
plt.title('Relationship between numbers of characters and periods in a chapter');

This sub-example illustrates the relationship between difference facets of our course: namely, the exploration and prediction facets. 