[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MattFinney/practical_data_science_in_python/blob/main/Session_1_Introduction_to_Data_Science.ipynb)

# Introduction to Data Science
## Core Concepts and Exploratory Data Analysis
Authors: Paulina Toro Isaza, Matthew Finney


---

## Agenda

1.   What is Data Science?
2.   Concepts and Methods
3.   Deep Dive: Music Recommendation
4.   Demo: Exploratory Data Analysis in Python

---

## What is Data Science?

![image.png](https://imgur.com/maii0F6.png)


Source: [A New Venn Diagram by Pierluigi Casale](https://open-dse.github.io/new-venn-diagram/)



### The Data Science Process

![data_iceberg.png](https://imgur.com/pxivm3x.png)

1. Formulate your Question
2. Get, Clean, and Transform the Data
3. Explore the Data
4. Model the Data
5. Communicate the Results

### Applications

*   Why does a video or meme go viral?
*   How can we better identify and predict disease?
*   Recommendation systems for music, movies, and other products
*   Language translation

---

## Concepts, Methods, and Applications


### Concepts

#### Predicting vs. Explaining

**Predicting**: Estimate some value based on input data.

> What will be the box office opening for a film?

**Explaining**: Determine how factors affect a particular outcome.

> What factors influence the box office opening of a film?  

---

#### Supervised vs. Unsupervised

**Supervised**: We know the "true" values of our output data.

> What is the genre of a book?

In the supervised case, each book will already be assigned a particular genre such as mystery, romance, fantasy, or memoir. Our model will be taught to predict the genre and our predictions will be compared to the "true" genre.

**Unsupervised**: We don't know the "true" values of our output data.

> What groupings differentiate between types of books best?

In the unsupervised case, our books are not already labelled with a genre. Instead, our model will try to create it's own labels based on whatever characteristics we pass in. Our results might mimic traditional genres or might show some unexpected groupings that challenge our notion of traditional genres!

--- 

#### Regression vs. Classification

**Regression**: The response variable for what we are trying to predict or explain is numerical. 

> How many views do we expect a particular YouTube video to have based on its title, thumbnail, subscriber count for the channel, and view count of the previous video on the channel?

**Classification**: The response variable for what we are trying to predict or explain is categorical. 

> How can we predict that a YouTube video is not suitable for children under 13 based on title and thumbnail?


---

### Methods

![methods.png](https://imgur.com/7mlirpP.png)

---

## Deep Dive: Music Recommendation

### Context

What happens when you reach the end of a playlist on platforms like Spotify or Youtube? Do you find yourself skipping the next song or did the platform give you a good follow-up to what you were listening to?

### Goal: How can we use data science methods to create playlists of similar songs?

#### Methods

**Unsupervised learning**: Our dataset does not contain any information about playlists or genres, so data is unlabeled. We don't know the "true" values of the genre or playlist.

**Dimension Reduction**: Our dataset contains about a dozen audio features used to describe a single song. Because it's hard to visualize information in more than three dimensions, we might want to reduce the number of dimensions in some way. 

**Classification**: This might sound like a classification problem because we want to classify songs as belonging to different playlists (genres). However, we usually talk about classification methods as those done in supervised learning.

**Clustering**: We'll be using this machine learning algorithm to partition the dataset into a pre-defined number of groups. The partitions do no rely on knowing the "true" values of the groups.



---

## Demo: Exploratory Data Analysis in Python

#### Importing Libraries

We need to tell Python what kind of functionality we want for our code. Instead of loading every single possible functionality, Python works by *importing* particular libraries so that we only have access to what's necessary.

There are many libraries such as those working with websites, audio, or social networks. But ince we are going to be working with data, we want to only import data related libraries. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set1')
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

  import pandas.util.testing as tm


**Pandas**: Open source data analysis and manipulation tool for Python. Industry standard.

**NumPy (Numerical Python)**: Open source tool for working with arrays, matrices, linear algebra, and other numerical operations in Python.

**Matplotlib**: Open source data visualization tool for creating plots, graphs, and other visuals. Typically used for more basic visuals.

**Seaborn**: Another open source data visualization tool in Python. Has more variety than matplotlib and comparatively simpler syntax. 

### Pre-Set Functions and Variables

No need to worry about understanding this code! Just be sure to run it before running anything else. 

In [None]:
# Data file path
url = 'https://drive.google.com/file/d/1KSqsf0s6_GuumpLdLP7owLk2AMJA-Qyl/view?usp=sharing'
file_path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

# Relevant columns
audio_feature_cols = ['danceability', 'energy', 'key', 'loudness', 'mode', 
                      'speechiness', 'acousticness', 'instrumentalness',
                      'liveness', 'valence', 'tempo', 'duration_ms',
                      'time_signature']

### Load the Data

Pandas lets you easily read in a CSV file as a **DataFrame**, the primary dat astructure for pandas. It is essentially a table where each record is a row with different features as columns.

In [None]:
tracks_df = __.________(file_path)

### Display Head and Summary

After reading in the data, we want to make sure we've read it in correctly.

This is most easily done by displaying the head, or top, of the DataFrame.

In [None]:
tracks_df.______

Another way to get a good, quick overview of the data is to look at the summary description. This can give us an idea of whether or not the range of values for numerical data makes sense or how much data is missing for a particular column.

In [None]:
tracks_df.__________

### Display Data Types

It's also helpful to make sure the columns were read in as the correct data types as some functions and analyses can only be done with certain types.

In [None]:
tracks_df._____

* **float64**: Decimal number
* **int64**: Integer (whole) number
* **object**: String with alphanumeric characters, not treated as a number

### Small Note about Cleaning

Data will often be in a format that's hard to work with or will have errors.  A large portion of data science work involves cleaning or restructuring data for proper use. 

In this workshop, we will be using data that has already been cleaned and structured in the way we want it. 


### High Level Exploratory Data Analysis Charts



#### Distribution Plots

Such plots are an easy way to quickly gain important information about a numerical data features. The visualization gives us an idea about what the most common values are, how spread out they are, what the complete range is, and what might be some outliers.


##### Dancibility

In [None]:
# matplotlib histogram / distribution plot for dancebility
plt.___(tracks_df['__'])

# Title and labels
plt._____('Histogram of Track Danceability') #plot title
plt.______('danceability') #x-axis label
plt.______('tracks') #y-axis label

Takeaways

##### Acousticness

In [None]:
# matplotlib histogram / distribution plot for acousticness
plt.____(tracks_df['__'])

# Title and labels
plt._____('Histogram of Track Acousticness') #plot title
plt.______('acousticness') #x-axis label
plt.______('tracks') #y-axis label

Takeaways

##### Tempo

In [None]:
# matplotlib scatter plot for loudness
plt._______(tracks_df['__'], tracks_df['__'], alpha = 0.7)

# Title and labels
plt._____("Scatterplot of Energy and Loudness") #plot title
plt._____("Energy") #x-axis label
plt._____("Loudness") #y-axis label

Takeaways

#### Correlation

Scatterplots of two variables can help us visualize the relationship between them. If two variables are highly correlated (or related), we might consider only using one of them in our model. 

In [None]:
# matplotlib scatter plot for loudness
plt._______(tracks_df['__'], tracks_df['__'], alpha = 0.7)

# Title and labels
plt._____("Scatterplot of Energy and Loudness") #plot title
plt._____("Energy") #x-axis label
plt._____("Loudness") #y-axis label

### Dimensionality Reduction
In the graph above, we were able to visualize the relationship between two variables, energy and loudness. However, our dataset has thirteen variables containing audio information. That's a lot of data to try to visualize at once. Thus, a common tactic is to reduce the *dimension* of the data in some way. 

#### Principal Components Analysis
The first method we're going to try is called Principal Components Analysis (PCA). 

Because we are using functionality that is not a part of the base Python, we need to import additional libraries.

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

##### PCA Function
We've already written and run function for performing the PCA and visualizing the first two principal components as it's easiest for people to visualize data using two or three dimensions. It was already run in code above.

In [None]:
# Put this up with other functions

rs = 123
np.random.seed(rs)

def pca_plot(df, classes=None):

    # Scale data for PCA
    scaled_df = StandardScaler().fit_transform(df)
    
    # Fit the PCA and extract the first two components
    pca_results = PCA().fit_transform(scaled_df)
    pca1_scores = pca_results[:,0]
    pca2_scores = pca_results[:,1]
    
    # Sort the legend labels
    if classes is None:
      hue_order = None
      n_classes = 0
    elif str(classes[0]).isnumeric():
      classes = ['Cluster {}'.format(x) for x in classes]
      hue_order = sorted(classes)
      n_classes = np.max(np.unique(classes).shape)
    else:
      hue_order = sorted(classes)
      n_classes = np.max(np.unique(classes).shape)

    # Plot the first two principal components
    plt.figure(figsize=(8.5,8.5))
    plt.grid()
    sns.scatterplot(pca1_scores,pca2_scores, s=50, hue=classes,
                    hue_order=hue_order)
    plt.xlabel("Principal Component {}".format(1))
    plt.ylabel("Principal Component {}".format(2))
    plt.title('Principal Component Plot')
    plt.show()

##### Principal Component Plot

In [None]:
pca_plot(_______)

Takeways

## Next Session
Session 2: Practical Data Science in Python  
Thursday, October 1 | 1:30pm - 2:45pm PDT  
https://careermingle2020.sched.com/event/dtq3

Our next session will be a live coding demo continuing our Spotify case study. We will:
* Introduce a common unsupervised learning technique, k-means clustering, to classify songs by similarity. 
* Create visuals using k-means clustering to see groupings
* Test our results!