# My First Data Science Notebook

## *Welcome Feinberg Medical Students!!!*

## Purpose

This interactive Colab notebook will help you become acquainted with the POWER of data science notebooks to retrieve files from... *anywhere*, preview file contents (e.g., just the first few rows of large tables of data), manipulate data, create neat visualizations, analyze data, and then apply machine learning to develop models and predict events! Cool, huh??? 

You might wonder which computer is doing this for you? Your own? NO... there is a processor far off in the Internet that, through the benefits of cloud computing, has been allocated to YOU and your work HERE, right now! Your personal computer is merely displaying content that this other processor will manipulate at YOUR direction! **YOU**... are in charge! 

But wait... you don't know how to program? No worries! This notebook will enable you to look behind the scenes and understand what the programs are doing... while not having to code anything yourself!

## How to use a notebook!

Of course, we mean how to use a *data science* notebook. This notebook is called a Colab notebook (very similar to another version, a Jupyter Notebook). Notebooks are assembled from cells and cells contain:

>  - text (like this that you're reading now)
>  - code (commands your cloud computer will execute for you)

### Navigation

How should we move up and down the page to explore these cells? Click once to select a cell and then:  

>  **Use the arrow keys on your keyboard.**

Try it out! Go up and down the page and come back here! Use this method to navigate since then you always know which cell is active - the one you highlighted and selected through the arrow keys! Once you're finished with a cell, and this one ends here, click the down arrow to move to the next cell.


### Editing

This cell (the one you're reading now) is authored with  simple formatting called *markdown*. If you double click inside this cell or hit *enter* (the return key) the screen will slit and you'll see the *raw* text nearby with the code revealed next to the polished text.

Try it - hit enter! Ok... now you can edit or delete any text! Don't delete everything... but also don't worry; this is your own view on your cloud computer. You're not changing anything for anyone else. Ummm... how do I go back and forth from editing to viewing? 

> **Use enter:** To enter the edit mode.  
> **Use esc:** To return to the viewing mode.


### How do I use the interactive stuff???

Some cells are coding cells. You'll recognize them since they have confusing symbols in them (like # or words like matplotlib). To execute (which means to run) the code in those cells there are two methods: 

> 1. **Click the play symbol** that appears next to that cell.
> 2. **Hold the control key and then click the enter key**. (Control-enter)




## Our Agenda

Now that you understand why we are here and how to navigate a data science notebook, we can get down to business. We will:  

1. Setup your cloud computer with the right programs so it can perform your analysis. You don't need to understand this other than we will help you tell your computer what programs it needs to manipulate and visualize data! (This is the Library Import section.)
2. Retrieve the data set you are going to explore! 
3. View some of the data so you know what you're working with!
4. Visualize different aspects of your data set.
5. Select a variable to predict. 
6. Choose a portion of your dataset to generate your predictive model. 
7. Validate your model against the fresh component of your dataset to see how well your model works!

# Prepare our cloud computer!

## Library import

This is where we tell our cloud computer which programs it needs. Just 2 things to know here:

> 1. **The # symbol**: Pay attention and read where the **#** symbol is in code cells like the one below. The text following the #'s are *comments* that explain the code so you can follow what's going on! 
> 2. **Control-enter** or click the play symbol: Remember - use **control-enter** to run the code! You'll see when the code from the cell is finished running when the **[  ]** before the cell will briefly change to include a "stop play" symbol and then switch to a number when complete. A green checkmark will also appear to the left. (Sometimes it's so fast, you can't see it; other times it may take a minute to process.) The number then indicates the order at which it executed on the page. 

In [None]:
# Be sure to run this critical first cell to get our cloud computer ready! Remember, click the "play" button to the left or control-enter.
# If you haven't logged into Google, you'll be prompted. Also... (and I say this with care...) Ignore the warning prompt if you see one and proceed.
# This will then load the right programs so we can analyze our data!

import pandas as pd
import numpy as np
import seaborn as sns

# Options for pandas to display well here
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualization tools
import plotly
import plotly.graph_objs as go
import plotly.offline as ply
plotly.offline.init_notebook_mode(connected=True)

import cufflinks as cf
cf.go_offline(connected=True)
cf.set_config_file(theme='white')

import matplotlib as plt

# Autoreload extension
if 'autoreload' not in get_ipython().extension_manager.loaded:
    %load_ext autoreload
    
%autoreload 2




# Data import

This notebook utilizes a dataset the Cleveland Clinic nicely made available with deidentified data. We'll retrieve it here and explore it!

In [None]:
# Run this cell to reference the website that is holding the data in a CSV (comma separated values) file! 

website = "https://docs.google.com/spreadsheets/d/e/2PACX-1vThOgP7ulRBYiAVrEWbQCad6Mh2Z1ceFDP-G8xPRUpry9st8Xppv0LpUW1X8nW0U-uaPqZ1KRgJ6--k/pub?gid=1217486054&single=true&output=csv"

# Then, the below command that is also executed when running this cell, will save the file inside something called a pandas dataframe. 
# Pandas is the program (e.g., like Excel) and think of a dataframe as a powerful table, like an Excel spreadsheet.
# We will call the file with the data "heart_disease". 

heart_disease = pd.read_csv(website)

# Data processing
 

First we'll view just the first few rows of the table to see what we're dealing with!

In [None]:
# Let's look at just the first few rows of data in our file. We can do this with the ".head" command. 
# And... No worries here! You don't need to memorize these commands. These explanations are simply so you can follow along!

heart_disease.head()

Now, let's figure out what type of data is reflected in the table.

In [None]:
# The ".dtypes" command will show us the data types for each column in our dataframe. (Remember a dataframe is like an Excel spreadsheet.)

heart_disease.dtypes

There are 14 columns. To start with, let's make this smaller with just 4 columns so we can explore how the variables relate to each other. Your cloud computer can analyze all of these at once, but it's easier for us humans to learn if we start with a smaller set. Let's pick age, cholesterol, systolic BP, and maximum heart rate. 

In [None]:
# The command below will make a new dataframe that contains only the columns specified. 
# Then, we'll be able to analyze how these values relate to each other.

hd2 = heart_disease[['Age', 'Cholesterol', 'BP', 'Max HR']]

Now, let's visualize how they are related to each other.

In [None]:
# The command below hints at power of what is possible for you! That is, in almost a flash,
# your cloud computer will create 2 by 2 graphs of each of the 4 variables in our dataframe 
# AND also perform a linear regression to reveal potential relationships among the variables.

g2 = sns.pairplot(hd2, kind = 'reg')


# Once you have run this cell, you should see 16 graphs, a bivariate (2 at a time) analyze of each of the 4 variables. 
# See discussion beneath the graphs.

### Exploration of 4 variables

There are a couple quick observations:  

1. The histograms plot values against frequency. So, for age, cholesterol, systolic BP, and max heart rate we see most values in the middle ranges and then few outliers at both ends. This makes sense!
2. Then, the linear regression lines in the bivariate analyses have some interesting findings: cholesterol tended to rise with age, BP tended to rise with choesterol, and max HR most clearly seemed to diminish with age. These all seem to make sense, too! (Of course... regardless if they "make sense," these are the findings from logistic regression against this Cleveland Clinic dataset. 

# References and more

I loved this so much, I want to make my own Jupyter notebooks! How do I do this???

Helpful references:
1. 
2. 