<a href="https://colab.research.google.com/github/Jonathan-Nyquist/PLAM/blob/main/Class11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id='Top'> </a>

# Class 11: Advanced Data Analysis

## Learning Objectives:

Once again, the goal of the notebooks in this part is to open your eyes to the possibilities and inspire you to continue to teach yourself Python on your own. Do not worry if you don't understand all of the data analysis techniques that are demonsrated here.

In the previous notebook we looked at some advanced ways of obtaining data. In this notebook we'll look at a few advanced data analysis techniques. In the final notebook we'll look at advanced data presentation.

In this class we will look at:
- [Statistics and Correlation](#Stats)
- [Image Processing](#Images)
- [Natural Language Processing](#Language)
- [Machine Learning](#Machine)
- [Student Challenge 1](#Student1)
- [Student Challenge 2](#Student2)

<a id='Stats'></a>
## Statistics and Correlation
[Top of Notebook](#Top)

Just for fun, and to illustrate the idea of spurious correlations, I asked students in one of my "Evil Plots" class (another General Education class I created) the following set of random questions:

1.	How many hours a week do you spend on school assignments and studying?
2.	How much do you love math on a scale of 1-10 (1=would rather have my teeth drilled, 10=math problems are better than ice cream and kittens)
3.	What day of the month were you born?
4.	What is your height in inches to the nearest inch?
5.	How many days did you spend at the beach this year?
6.	What is the most miles you’ve driven a car in a single day, ever?
7.	How many songs do you listen to in a day?
8.	How many slices of pizza did you eat in the past month (best estimate).
9.	How many states have you visited in your life? (Driving though or stopped in an airport count)
10.	How many letters are in your first and last name combined?

*** Let's load the data and look for correlations! ***

In [None]:
!wget https://github.com/Jonathan-Nyquist/PLAM/raw/main/RandomData.xlsx

In [None]:
import pandas as pd
data =  pd.read_excel('RandomData.xlsx')
data

I learned two things just looking at the data:
1. Some students eat just an incredible amount of pizza!
2. There are always a few who can't follow directions. When asked what day of the month they were born, they gave me the month of the year. So I had to enter NaN (not-a-number) values.

There is no reason to believe any of these data should be correlated. I mean, pizza consumption and letters in your name... Seriously? But let's take a look.

In [None]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

# Extract columns
x = data['PizzaSlices']
y = data['LettersInName']

# Fit a line to the data
# numpy has a function that fits polynomials and a line is just a first degree polynomial.
fit = np.polyfit(x, y, deg=1)
print(fit)

# Create figure, plot line and data
fig, ax = plt.subplots()
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.scatter(x, y)
plt.xlabel('Pizza Slices Consumed Per Month')
plt.ylabel('Combined Number of Letters in Student Name')

No surprise here; it is not a very good correlation.  We could plot all the combinations against each other one at a time, but that would be tedious. Pandas to the rescue! (Again, don't worry about a few of the advanced tricks in the code.)

In [None]:
# Create a matrix of scatter plots by looping over all the columns
# We import a package that will do the hard work for us.
from pandas.plotting import scatter_matrix

axes = scatter_matrix(data, diagonal='hist', figsize=(12,12))
corr = data.corr().to_numpy()

# This bit is tricky, but we are looping over all the combinations
for i, j in zip(*plt.np.triu_indices_from(axes, k=1)):
    axes[i, j].annotate("%.3f" %corr[i,j], (0.8, 0.8), xycoords='axes fraction', ha='center', va='center')
plt.show()

1 = perfect positive correlation, 0 = no correlation at all, -1 = perfect negative correlation.

Most of the correlation coeffecients are very low, as expected.  The strongest correlation is 0.65.  Let's take a closer look at that one.

In [None]:
# Extract columns
x = data['PizzaSlices']
y = data['BeachDays']

# Fit a line to the data
fit = np.polyfit(x, y, deg=1)

# Create figure, plot line and data
fig, ax = plt.subplots()
ax.plot(x, fit[0] * x + fit[1], color='red')
ax.scatter(x, y)
plt.xlabel('Pizza Slices Consumed Per Month')
plt.ylabel('Days Spent At The Beach This Summer')

#data.plot.scatter('PizzaSlices', 'LettersInName')

This is when your brain starts thinking, "Hmmm.  Maybe people who eat a lot of pizza are the type who laze around on the beach all summer." Well, maybe...  More likely, this is just a spurious correlation. If you plot enough things against each other you're bound to find some that correlate.

### Student Challenge
Here is a really fun book on [spurious correlations](http://tylervigen.com/spurious-correlations). Click on the link and look at a few of the examples. Comment on your favorite in the markdown cell below.

But I digress.  The point was that we just did a pretty sophisticated data analysis of all of the possible pair-wise correlations between ten different variables with only a few lines of python code.

**Whether you're a scientist analysing experimental results, a business analyst summarizing stock trends, or a sociologist evaluating risk factors for children, being able to power through your data like this gives you a superpowers beyond the imagination of puny mortals! With enough Python code you can take over the world.  MU HA HA HA!**

(Sorry. Forgot to take my meds this morning.) 👨🏽‍🔬

<a id='Student1'></a>
## Student Challenge
[Top of Notebook](#Top)

Try plotting one of the other combinations, such as love of math versus number of songs listened to per day. Include the regression line.

<a id='Images'></a>
## Image Processing
[Top of Notebook](#Top)

The second form of advanced data analysis we'll explore is image analysis. Here is an image of geologic beds disrupted by faulting and folding (kind of like my bed in the morning).

!['Rolled Beds'](https://pbs.twimg.com/profile_images/229857411/SanAndreas.jpg)

You can see the sediment bedding, normally horizontal, has been distorted. But to extract the bedding layers from the image we can use a technique called "edge detection." **Let's see if we can enhance the boundaries between the layers.**

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Here we import a new module used for image processing
from skimage import io

# Load the image directly from the web using the URL (web address)
photo = io.imread('https://pbs.twimg.com/profile_images/229857411/SanAndreas.jpg')
plt.figure(figsize=(20,10))
plt.imshow(photo)

In [None]:
from skimage.color import rgb2gray

# Convert the image from color to grayscale
img_gray = rgb2gray(photo)
plt.figure(figsize=(20,10))
imgplot = plt.imshow(img_gray, cmap='gray')

In [None]:
# Adjust the colors by equalizing the image histogram, this increases the contrast
# see: http://scikit-image.org/docs/dev/auto_examples/plot_equalize.html)

from skimage import exposure

img_eq = exposure.equalize_hist(img_gray)
plt.figure(figsize=(20,10))

# Plot the image in black & white
imgplot = plt.imshow(img_eq, cmap='gray')

In [None]:
from skimage import feature

# Use the Canny filter to find the edges.
# http://scikit-image.org/docs/dev/auto_examples/edges/plot_canny.html#sphx-glr-auto-examples-edges-plot-canny-py
# What this filter does is locate sudden changes from dark to light areas in the image which are typically boundaries

edges = feature.canny(img_eq)
plt.figure(figsize=(20,10))
imgplot = plt.imshow(edges, cmap='gray')

In [None]:
# If we only want to see the strongest edges we can raise the threshold to only flag boundaries that
# are serveral standard deviations above the background -- only the biggest changes.

edges2 = feature.canny(img_eq, sigma=3)
plt.figure(figsize=(20,10))
imgplot = plt.imshow(edges2, cmap='gray')

**Books and books have been written on image processing.  Python has many modules to help.**

<a id='Language'></a>
## Natural Language Processing
[Top of Notebook](#Top)

**A picture is worth a thousand words.** But sometimes you have to work with words not pictures. Most people know computers can be used to analyze images, but text? Yes, there is a lot of research in this areas as well, that goes way beyond just finding out how many times the word "NASA" appears in a newspaper article. Just as processing images is critical to problems such as computer vision and robotics, teaching a computer to process text is part of the quest for artificial intellegence. SIRI on your Iphone is a prime example - voice recognition software feeds into natural language processing for SIRI to cater to your every whim. Now, of course, we have ChatGPT, Bard and other generative AI platforms.

Natural language processing is used for many tasks, such as:
* sentiment analysis
* spam filtering
* plagarism detection
* document categorization
* phrase extraction
* smarter searches
* keyword analysis


**Let's get a glimpse of natural language processing with Python.**

First, we need some text to work with...

In [None]:
quote = '''LOG ENTRY: SOL 381 I’ve been thinking about laws on Mars.

Yeah, I know, it’s a stupid thing to think about, but I have a lot of free time.

There’s an international treaty saying no country can lay claim to anything that’s not on Earth.
And by another treaty, if you’re not in any country’s territory, maritime law applies.

So Mars is “international waters.”

NASA is an American nonmilitary organization, and it owns the Hab. So while I’m in the Hab, American
law applies. As soon as I step outside, I’m in international waters. Then when I get in the rover, I’m
back to American law.

Here’s the cool part: I will eventually go to Schiaparelli and commandeer the Ares 4 lander. Nobody
explicitly gave me permission to do this, and they can’t until I’m aboard Ares 4 and operating the
comm system. After I board Ares 4, before talking to NASA, I will take control of a craft in international
waters without permission.

That makes me a pirate!

A space pirate!'''

### Tokenizing
Let's take the quote above and using Python's natural language toolkit we'll "tokenize" it. This breaks the quote into lists of substrings and punctuation.

In [None]:
import nltk
nltk.download('punkt')
tokens = nltk.word_tokenize(quote)

In [None]:
print(tokens)

In [None]:
# Break the quote apart by sentences.
sentences = nltk.tokenize.sent_tokenize(quote)

# Print each sentence with a black line in between.
for sentence in sentences:
    print(sentence)
    print()

You're probably not impressed. Splitting on sentences just means looking for periods, right? Not really. You have to look for question marks, exclamation points and sentences where the period is inside the quotes.  It's trickier than it sounds.

But wait, there's more.  NLTK also know parts of speech. Is a given word a noun, a verb, a proposition?

Here are what some of the tags mean:

[![Tagging.jpg](https://i.postimg.cc/Jhd85ZXH/Tagging.jpg)](https://postimg.cc/p98329qP)

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
pos_tag(tokens)

We will not try to go farther than this, but natural language processing is big business and also related to our next topic...

<a id='Machine'></a>
## Machine Learning
[Top of Notebook](#Top)

Every time you shop on Amazon, or browse for movies on NetFlix, the site offers recommendations based on your previous orders or browser history. The machine is learning all about you.
&nbsp;

When you search for cute kitten pictures on Google images, how on earth does the computer know which of the millions of images uploaded to the web have cats, let alone which ones are cute? Machine learning again.
&nbsp;

&nbsp;

![Puss](https://s-media-cache-ak0.pinimg.com/originals/d3/e9/fc/d3e9fc222c9bd0d12e0ff126acf7df00.png)
&nbsp;

&nbsp;

How does your email program know to put the latesst plea for help from that Nigerian prince with the frozen bank account into your spam folder?  You guessed it. Machine learning.

**Most applications of machine learning fall into one of three categories:**
1. Regression - where you use data to predict something like stock trends.
2. Supervised Classification (clustering) - where the computer learn to classsify stuff based on examples provided, such as images with a cat or no cat.
3. Unsupervised Classification - where the computer sorts things into groups it thinks are similar in some way, such as recognizing items that tend to be purchased together.

Python has a machine learning module called skikit-learn that implements many different algorithms for all three types of problems. Machine learning is a huge field, and I'm not an expert (although Temple has a few faculty members who are experts), so we will just look at a simple example to get an idea of how it works.

We will consider an example of unsupervised classification. The basic goals it to sort things into categories based on a list of traits. You want things put in the same category to be more similar to each other than to things in other categories. The categories will be determined as you go along. You probably did something like this as a kid, sorting pebbles into different piles, or organizing your toys. Okay, well your mom probably did the toy thing.

A common algorithm that is used is called [k-means.](https://en.wikipedia.org/wiki/K-means_clustering)

**Let's apply this algorithm to the questions I asked my "Evil Plots" class.**

In [None]:
import pandas as pd
data =  pd.read_excel('RandomData.xlsx')
plt.scatter(data['MathLove'], y=data['HoursStudying'])
plt.xlabel('Love of Math')
plt.ylabel('Hours/week Spent Studying')

There is clearly not much overrall correlation between student's love of math and the number of hours per week they report studying. But are there clusters? That is, are there groups of students that tend to answer the same way to both questions? To the human eye there appear to be three clusters: one group of three students clusters in the lower left corner of the graph that hates math and spents little time studying, one group that also likes math and spends little time studying, and one group that that feels so-so about math and studies a lot. There is also one oddball who really hates math but studies more than anyone else.

Let's not get carried away interpreting these clusters. We only have sample size of 13 students. What we want to know is whether or not computers can spot these same clusters.

Because in a typical data set some categories (called attributes) have larges numbers (driving miles) and others have small numbers (math love), a common preprocessing step is to scale each attribute to have a similar mumeric range so that the attributes with large numbers don't dominate when we combine data. This is called "standarizing the data." We subtract the average so the data will be centered on zero and divide by the standard deviation so the numbers will have similar ranges.

In [None]:
# Standardize the data
data_standardized = (data - data.mean()) / data.std()

plt.scatter(data_standardized['MathLove'], y=data_standardized['HoursStudying'])
plt.xlabel('Standardized Love of Math')
plt.ylabel('Standardized Spent Studying')

The plot looks the same, but notice that the x and y axes now have similar ranges.

In [None]:
data_standardized.head()

The same information is plotted, but now it each attribute has a mean of zero and a standard devation of one. The units on the graph are now standard deviaions from the mean.

Just to be clear, what we are doing is asking the computer to look for clusters within these 13 students based on two attributes: time spent studying and love of math. We'll see if we can divide them into three clusters. The K-means algorithm we'll be using does this by looking for groups where each member is closer to the average of the group than it is to members of other groups, which is pretty much what we would use as the definition of a cluster.

In [None]:
# We will use the sklearn module
from sklearn.cluster import KMeans

cols = ['MathLove', 'HoursStudying']
data4cluster = data_standardized[cols]
kmeans = KMeans(n_init=3, n_clusters=3, random_state=0).fit_predict(data4cluster)

In [None]:
# The three groups are identified as 0, 1 and 2.
data['kmeans']=kmeans
data.head()

In [None]:
# Define colors list, to be used to plot groups: red (=0), green (=1) or blue (=2)
colors=['red','green','blue']
col = []
for ind in kmeans:
    col.append(colors[ind])
plt.scatter(data['MathLove'], data['HoursStudying'], c=col, s=50)
plt.xlabel('Love of Math')
plt.ylabel('Hours per Week Spent Studying')

The computer found the expected clusters. The one oddball in the upper left was put in the feels so-so about math and studies a lot, despite having a really strong aversion to math. That is because we told the computer to find exactly three groups, so it had to stick him/her somewhere. This choice of the number of clusters to create is always a question.

What if you were told to divide this bunch into exactly two groups?  How would you cluster them? Let's see what the algorithm does.

In [None]:
# Force the data into two clusters

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0).fit_predict(data4cluster)

colors=['red','green']
col = []
for ind in kmeans:
    col.append(colors[ind])
plt.scatter(data['MathLove'], data['HoursStudying'], c=col, s=50)
plt.xlabel('Love of Math')
plt.ylabel('Hours per Week Spent Studying')

For two clusters you should be able to draw a line that divides the two populations. You can see where the line would go in this case, but choise of clusters is not unique. Other divisions are possible.

### Take-home Message
This example demostrated how you can using a computer algorithm to divided people into different categories based on measured attributes. We did a very simple example with only two attributes, math and studying, but the computer could just as easily have used hundreds of factors. It could have been a lengthy questionaire about your preferences in and ideal mate as part of a dating service, your political views to target you for campaign contributions, or you web browsing history to bombard you with advertisements people in "your cluster" are most likely to click on.

Many more sophisticated algorithms are continually sifting through your data without human guidance. As you sign away more and more of your personal data in exchange for convenient, free applications, the machines are reading your social media posts, looking at your photos on SnapChat, reviewing your fitness data, monitoring your GPS location, and tracking your browsing history --- and learning all your hopes, fears, and deep, dark secrets. If you listen closely, somewhere in the endless binary stream of ones and zeros you can hear the machine's whispered laughter.

![TwitterBot](http://imgs.xkcd.com/comics/twitter_bot.png)

<a id='Student2'></a>
## Student Challenge 2
[Top of Notebook](#Top)

See what happens if you divide the data into four clusters. You will need to add another color to the list to plot the results. [Here is a list](http://matplotlib.org/api/colors_api.html) of the colors that are predefined in matplotlib.