# Intro to Data Science

[Gina Sprint](https://ginasprint.com/)

# Introduction
What are our learning objectives for this lesson?
* Understand the general field of data science
* Run a Python program on their own computer
    * Interactive mode
    * Scripting mode

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm up Task(s)
Welcome to class! I'm really glad to be here with you all!
1. Go to our Moodle site
    * Note: you should have received an invite email to Moodle, please let me know if you need me to resend it
1. In chat, please introduce yourself with your preferred name and something interesting about you 😀
1. Make a [Github](https://github.com/) account

## Today
1. Introductions
1. What is Data Science?
1. Git/Github
1. Review of Python

## TODO
1. List practice problems below (not graded)
1. Work on Quiz 1 in Moodle

### 1D List Practice Problem
In ListFun, write code that generates 20 random numbers between 1 and 10 inclusive and puts them in a 1D list. The program then does the following using the list:
* Prints the numbers all one line, each number separated by a space
* Sorts the list using a list method
* Prints the largest and smallest number in the list
    * Hint: can you take advantage of the current ordering of your list?
* Determines the number of times a user-specified number is in the list 
* Removes all instances of a user-specified number in the list. If the number is not in the list print the message: "Sorry, your number is not here!"

Note: for practice with functions, try solving this problem using functions :)

### 2D List Practice Problem
In ListFun, write code that generates 50 random numbers between 1 and 10 inclusive and puts them in a 2D list that is 10x5 (e.g. 10 rows and 5 columns). The program then does the following using the list:
* Prints the numbers in a nice grid format (like a table)
* Prints the largest and smallest number in the list
* Determines the number of times a user-specified number is in the list 
* Removes all instances of a user-specified number in the list. If the number is not in the list print the message: "Sorry, your number is not here!"

Note: for practice with functions, try solving this problem using functions :)

## What is Data Science?
Data science is the science of analyzing data to gain insight, draw conclusions, or make decisions about the data. 

What are examples of data in the real-world and how is that data being analyzed?
* Medical data collected from electronic health records, physician/nurse notes, etc.
    * Analyzed to determine health risk factors, onset of early disease, insurance billing, etc.
* Time series data collected from sensors installed in the environment or worn on the body (wearables)
    * Analyzed to detect physical activity, daily behavior, changes in behavior over time, etc.
* Social media data collected from social networks, posting, news feeds, etc.
    * Analyzed to suggest friends, deliver user-specific content, recommend products, target advertising, etc.
* Financial data collected from banking transactions, trading, etc.
    * Analyzed to project stock market trends, recommend certain investments, determine credit scores, etc.
* Many others

What do Data Scientists do? Data scientists spend a surprising amount of time preparing data for analysis. In fact, a survey was conducted found that cleaning big data is the most time-consuming and least enjoyable task data scientists do!
<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg" width="700">

(image from [https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg))

Some topics related to data science that we will cover in this class (at a high level) includes the following:
* [Data representation/cleaning/munging/wrangling](https://en.wikipedia.org/wiki/Data_wrangling): Describes the overall process of manipulating unstructured and/or messy data into a structured and clean form.
* [Data mining](https://en.wikipedia.org/wiki/Data_mining): The computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning): Provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data.

## Python
In this class, we are going to learn and use the Python programming language for all of our coding assignments. According to [IEEE Spectrum](http://spectrum.ieee.org/computing/software/the-2016-top-programming-languages), Python is a top 3 programming language of 2016 and according to [KDNuggets](http://www.kdnuggets.com/2014/08/four-main-languages-analytics-data-mining-data-science.html), Python is a top two programming language for analytics, data mining, and data science (second only to R). 

### Why Use Programming for Data Science?
* Faster than analyzing by hand (especially for large data)!
* Reuse other data, same data different params/settings
* Enables a form of “repeatability” (and ideally, transparency)
    * can repeat “experiment” and get the same result
    * no “magic” steps
* Still important, however, to write down steps (log)
    * ideally, someone should be able to take your data, program, and description of steps, rerun everything, and get the same results!

### Why Use Python for Data Science?
Advantages of learning Python include:

1. Easy to learn
1. Free, open source
1. Support for the life cycle of software (prototyping, development, testing, release, maintenance)
1. Many available libraries, especially for data analytics:
    1. [numpy](http://www.numpy.org/)
    1. [scipy](https://www.scipy.org/)
    1. [sci-kits](https://scikits.appspot.com/) (especially [sci-kit learn](http://scikit-learn.org/stable/) for machine learning)
    1. [pandas](http://pandas.pydata.org/)
    1. [Plotting libraries](https://wiki.python.org/moin/NumericAndScientific/Plotting), such as [matplotlib](http://matplotlib.org/) and [Plotly](https://plot.ly/)
1. Many supported GUI backends
1. LOTS of community support/development online
1. Cross platform support
    * Python is an interpreted language, which means it can run on any system with the Python interpreter installed; however, this is also a disadvantage in some ways, meaning Python code can be slow to run, compared with compiled languages like C
    
### Python Distribution and IDE
We will use the [Anaconda v3.9](https://www.anaconda.com/products/individual) Python 3 distribution. This is a free distribution of Python version 3 available for Windows, OS X, and Linux. You can download Anaconda3 [here](https://www.anaconda.com/products/individual) and view the installation instructions [here](https://docs.anaconda.com/anaconda/install/).

Anaconda comes packaged with an easy-to-use integrated development environment (IDE) called [Spyder](http://spyder-ide.org/) (Scientific Python Development Environment) and (optionally) [Visual Studio Code](https://code.visualstudio.com/). I encourage you to use Spyder or Visual Studio Code (or one of the following [Anaconda-supported IDEs](https://support.anaconda.com/customer/en/portal/articles/2880333-using-an-ide-with-anaconda) to develop your Python code).

## Datasets
Our focus is “Tabular” Data ... aka Relational or Structured
* Data is organized into tables (rows and columns)

Age |Gender |Impressions |Clicks |SignedIn
-|-|-|-|-|
59 |1 |4 |0 |1
19 |0 |5 |0 |1
44 |1 |5 |0 |1
28 |1 |4 |0 |1
61 |1 |10 |1 |1
0 |0 |3 |1 |0

* You are already familiar with tabular data! Data in an Excel spreadsheet is structured in a tables

<img src="https://stablemanagement.com/.image/t_share/MTQ1MDY3NjE3NzUzMzc2MDMy/stable-management-board-paid-worksheet.jpg" width="400">

(image from https://stablemanagement.com/.image/t_share/MTQ1MDY3NjE3NzUzMzc2MDMy/stable-management-board-paid-worksheet.jpg)

* Each row is an "instance"
    * aka "example", "record", or "object"
* Each column is an “attribute” (of the instance)
    * aka "variables" or "fields"
* A "dataset" is a (sample) set of instances
    * from the "universe of objects" (universe of instances)

This is a sample of (simulated) daily website click stream data (Example from "Doing Data Science", Schutt and O’Neil)
* Each row contains attribute values for one user
* User’s age, gender (0=female, 1=male), ads shown, ads clicked, and if
logged in (0=no, 1=yes)