# What is Data Science?

## Class Profile

Before we start, let's take a quick look into the profile of the students in our class!

First we need some modules.  **Don't worry about the code for now!**

In [None]:
# load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# needs module ipympl
# %matplotlib widget

# without the line below, widget cuts off legends...
plt.rcParams["figure.constrained_layout.use"] = True

%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 8)

# style
# plt.style.use('fivethirtyeight')
plt.style.use("ggplot")

Let's load the data and check the first 10 lines.

In [None]:
class_data = pd.read_csv("anonymous_class_data.csv")
class_data.head(10)

Let's see how many students we have from each major:

In [None]:
majors_data = class_data["Major"].value_counts()
majors_data.to_frame()

It would be nicer to visualize it with a (bar or pie) graph!

In [None]:
# bar plot
majors_data.plot(kind="barh")

plt.title("Majors in Data 201")
plt.xlabel("Count")

# pie chart
# majors_data.plot(kind="pie", autopct='%1.2f%%')

plt.show()

Now, let's see the class (Freshman, Sophomore, ...) of our students.

In [None]:
classes = ["Freshman", "Sophomore", "Junior", "Senior"]
year_data = class_data["Class"].value_counts().reindex(classes)
year_data.to_frame()

In [None]:
# bar plot
year_data.plot(kind="barh", rot=40)

plt.title("Classes in Data 201")
plt.ylabel("Count")

# pie chart
# year_data.plot(kind="pie", autopct="%1.2f%%")

plt.show()

Let's see how the **top enrollment majors** are divided into classes.

**Note:** This class is too small to pick top majors, so here we will look at *all* majors.

In [None]:
# minimun number of students in major
threshold = 1

# we have no Freshmen!
classes = ["Sophomore", "Junior", "Senior"]

majors = majors_data.loc[majors_data >= threshold].index.to_numpy()

year_major_data = class_data.pivot_table(index="Major", columns="Class", aggfunc=len).fillna(0).loc[majors, :][classes]
year_major_data

Now, let's graph it:

In [None]:
year_major_data.plot(kind="barh")

plt.title("Class Breakdown")
plt.ylabel("Count")

plt.show()

We can also flip it around:

In [None]:
year_major_data.T.plot(kind="barh")

plt.title("Class Breakdown")
plt.ylabel("Count")

plt.show()

## Introduction

Data Science is about drawing conclusions for large amount of data, through:
- **Exploration:** identifying patterns.  *Tools:* visualizations and descriptive statistics.
- **Prediction:** make informed guesses.  *Tools:* machine learning and optimization.
- **Inference:** quantifying our degree of certainty.  *Tools:* statistical tests and models.

Data Science is more than that: 

> Data scientists learn to ask appropriate questions about their data and correctly interpret the answers provided by our inferential and computational tools.

We must also learn how to apply these ideas in the real world!

## Why Data Science?

Basically, for making better decisions:
- Humans have biases.
- Data validates predictions.
- Methods can be tested and improved.
- Based on data/facts.
- Data can be too large or complex for human parsing.

## Example: Plotting the Classics

We explore statistics from:
1) [*The Adventures of Huckleberry Finn*](https://www.inferentialthinking.com/data/huck_finn.txt) by Mark Twain; and
2) [*Little Women*](https://www.inferentialthinking.com/data/little_women.txt) by Louisa May Alcott. 

***Don't worry about the code (yet)!***  The goal here is to simply see the tool in action.  We will learn the code later.

First, we need to load some Python tools:

In [None]:
# for reading the books sites
from urllib.request import urlopen
import re

In [None]:
# Read two books, fast!
huck_finn_url = 'https://www.inferentialthinking.com/data/huck_finn.txt'
huck_finn_text = urlopen(huck_finn_url).read().decode()

In [None]:
# print(huck_finn_text)

To better parse it, let's remove multiple instances of spaces (including new lines and tabs) by a single space:

In [None]:
huck_finn_text = re.sub('\\s+', ' ', huck_finn_text)

In [None]:
# print(huck_finn_text)

Let's make a function to automate the reading of a URL:

In [None]:
def read_url(url):
    """
    Reads the content of a URL, decoded, and replace multiple spaces (inclding new lines
    and tabs) by a single space.
    """
    return re.sub("\\s+", " ", urlopen(url).read().decode())

Now, the second book:

In [None]:
little_women_url = 'https://www.inferentialthinking.com/data/little_women.txt'
little_women_text = read_url(little_women_url)

In [None]:
# print(little_women_text)

We now want to read divide the book in chapters.  In both instances, each chapter starts with `CHAPTER`.  We need some trial and error (or some investigating and counting in the original files.)

In [None]:
huck_finn_chapters = huck_finn_text.split("CHAPTER ")[44:]
little_women_chapters = little_women_text.split("CHAPTER ")[1:]

### Characters In The Adventures of Huckleberry Finn

*The Adventures of Huckleberry Finn* describes a journey that Huck and Jim take along the Mississippi River. Tom Sawyer joins them towards the end.

Lets create a *data frame* with the number of occurrences, indexed by the chapters:

In [None]:
data_dict = {
    "Jim": np.char.count(huck_finn_chapters, "Jim"),
    "Tom": np.char.count(huck_finn_chapters, "Tom"),
    "Huck": np.char.count(huck_finn_chapters, "Huck"),
}

chapters = np.arange(1, len(huck_finn_chapters) + 1)

# use chapter number as index
counts = pd.DataFrame(data=data_dict, index=chapters)

Let's now see the number of occurrences of each character by chapter:

In [None]:
counts

It might be better to *visualize* it!

In [None]:
counts.plot(kind="bar", y=["Jim", "Tom", "Huck"])
plt.title("Occurrence of Name per Chapter")
plt.xlabel("Chapter")
plt.ylabel("Number of Occurrences")
plt.show()

This graph is a bit hard to visualize...  How about a *cumulative* count?

In [None]:
cum_counts = counts.cumsum()
cum_counts.plot()
plt.title("Cumulative Occurrence of Name")
plt.xlabel("Chapter")
plt.ylabel("Cumulative Number of Occurrences")
plt.show()

You can see:
- Jim is a central character by the large number of times his name appears. 
- Tom is hardly mentioned until he arrives and joins Huck and Jim, after Chapter 30. 
- Huck's name hardly appears at all, since he is the narrator.

### Characters of Little Women

*Little Women* is a story of four sisters growing up together during the civil war.

In [None]:
characters = ["Amy", "Beth", "Jo", "Meg", "Laurie"]
data_dict = {char: np.char.count(little_women_chapters, char) for char in characters}

chapters = np.arange(1, len(little_women_chapters) + 1)

# use chapter number as index
counts = pd.DataFrame(data=data_dict, index=chapters)

Again, let's see the counts:

In [None]:
counts

Let's plot it:

In [None]:
counts.plot(kind="bar", y=characters)
plt.title("Occurrence of Name per Chapter")
plt.xlabel("Chapter")
plt.ylabel("Number of Occurrences")
plt.show()

And the cumulative plot:

In [None]:
cum_counts = counts.cumsum()
cum_counts.plot()
plt.title("Cumulative Occurrence of Name")
plt.xlabel("Chapter")
plt.ylabel("Cumulative Number of Occurrences")
plt.show()

Jo appears the most, as she is the protagonist.  At Chapter 27, she moves to New York alone, so her interactions with her sisters (and hence their names occurrences) decrease.

Laurie is a young man who marries one of the girls in the end. See if you can use the plots to guess which one.

### Some More Data

Let's see how long the chapters are in both books, in terms of number of *printed* characters and number of sentences.  To count sentences, we simply count the number of periods.

In [None]:
huck_data_dict = {
    "Huck Finn Chapter Length": [len(chapter) for chapter in huck_finn_chapters],
    "Number of Periods": np.char.count(huck_finn_chapters, "."),
}

n_huck_chapters = np.arange(1, len(huck_finn_chapters) + 1)

chars_periods_huck_finn = pd.DataFrame(huck_data_dict, index=n_huck_chapters)

Let's see the counts for the first 10 chapters:

In [None]:
chars_periods_huck_finn.head(10)

Repeating it for *Little Women*:

In [None]:
lw_data_dict = {
    "Little Women Chapter Length": [len(chapter) for chapter in little_women_chapters],
    "Number of Periods": np.char.count(little_women_chapters, "."),
}

n_lw_chapters = np.arange(1, len(little_women_chapters) + 1)

chars_periods_little_women = pd.DataFrame(lw_data_dict, index=n_lw_chapters)

In [None]:
chars_periods_little_women.head(10)

We can also get some statistics:

In [None]:
chars_periods_huck_finn.describe().style.format(precision=2, thousands=",")

In [None]:
chars_periods_little_women.describe().style.format(precision=2, thousands=",")

Clearly, *Little Women* has longer chapters!

Let's visualize the relation of the number of characters and number of periods in both books simultaneously:

In [None]:
axis = chars_periods_huck_finn.plot(1, 0, kind="scatter", color="darkblue", label="Huck Finn")
chars_periods_little_women.plot(1, 0, kind="scatter", color="gold", label="Little Women", ax=axis)

plt.xlabel("Number of periods in chapter")
plt.ylabel("Number of characters in chapter")

plt.show()

We can see:
- *Little Women* has longer chapters (as already noticed).
- Points for each book seem to be clustered around a line.  (Maybe a single line for both!)
- Looking where they lengths are similar, the number of periods are also similar.

Let's find the actual average length of the sentences for both:

In [None]:
chars_periods_huck_finn["Huck Finn Chapter Length"].sum() / chars_periods_huck_finn[
    "Number of Periods"
].sum()

In [None]:
chars_periods_little_women[
    "Little Women Chapter Length"
].sum() / chars_periods_little_women["Number of Periods"].sum()

So, *The Adventures of Huckleberry Finn* has, in average, longer sentences, but only by two characters (out of about 112)!

Let's see also print the average to see how the points appear related to this rough average.

In [None]:
aver = 112
minx = min(
    chars_periods_huck_finn["Number of Periods"].min(),
    chars_periods_little_women["Number of Periods"].min(),
)
maxx = max(
    chars_periods_huck_finn["Number of Periods"].max(),
    chars_periods_little_women["Number of Periods"].max(),
)

# create figure and axes
fig, ax = plt.subplots()

# average line
plt.plot(
    [minx, maxx],
    [aver * minx, aver * maxx],
    color="red",
    linestyle="dashed",
    linewidth=2,
    label="Average",
)

chars_periods_huck_finn.plot(1, 0, kind="scatter", color="darkblue", label="Huck Finn", ax=ax)
chars_periods_little_women.plot(1, 0, kind="scatter", color="gold", label="Little Women", ax=ax)


plt.xlabel("Number of periods in chapter")
plt.ylabel("Number of characters in chapter")

plt.show()