# [CPSC 222](https://github.com/GonzagaCPSC222) Intro to Data Science
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Introduction
What are our learning objectives for this lesson?
* Understand the general field of data science

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm up Task(s)
Welcome to class! I'm really glad you are here
* Task #1: If you are going to use a lab machine, please log into it now
* Task #2: Make sure you can access Canvas: https://canvas.gonzaga.edu
* Task #3: Meet your neighbor and see how they are doing 😀

## Today
1. Class is being Zoom recorded
1. Attendance/ice breaker -- with prizes!! 🎖️
1. Class resources overview
1. BREAK
1. What is Data Science?
1. TODO: Before next class
    1. Make a [Github](https://github.com/) account, install [Git](https://git-scm.com/downloads), install [Anaconda Python Distribution](https://www.anaconda.com/products/individual), and install [VS Code](https://code.visualstudio.com/) before next class
    1. See [U0 Introduction Lesson B Environment Setup](https://github.com/GonzagaCPSC222/U0-Introduction/blob/master/B%20Environment%20Setup.ipynb) for details on how to do this
    1. Please complete the welcome questionnaire in Canvas

## What is Data Science?
Data science studies how to use data to solve problems and answer research questions. It constitutes all parts of a data-intensive workflow, from the beginning (e.g., data collection, data preparation), through the middle (e.g., data mining, supervised/unsupervised machine learning) to the end (e.g., presenting insights and/or deploying a software system). It helps to think about data science as science, meaning it uses the scientific method we are all familiar with (1. observe/question, 2. research, 3. form a hypothesis, 4. test with experiment, 5. analyze data, 6. report conclusions), but with a focus on data:
1. Identify: pick something you are curious about and collect data about it
1. Understand: get familiar with your data and its "bigger picture"
1. Process: prepare your data for analyses 
1. Analyze: look closely at the data to discover previously unknown patterns, trends, associations, groups, etc
1. Conclude: draw valid conclusions and discuss potential action items
1. Communicate: share your knowledge with your targeted audience

What are examples of data in the real-world and how is that data being analyzed?
* Medical data collected from electronic health records, physician/nurse notes, etc.
    * Analyzed to determine health risk factors, onset of early disease, insurance billing, etc.
* Time series data collected from sensors installed in the environment or worn on the body (wearables)
    * Analyzed to detect physical activity, daily behavior, changes in behavior over time, etc.
* Social media data collected from social networks, posting, news feeds, etc.
    * Analyzed to suggest friends, deliver user-specific content, recommend products, target advertising, etc.
* Financial data collected from banking transactions, trading, etc.
    * Analyzed to project stock market trends, recommend certain investments, determine credit scores, etc.
* Many others

What do Data Scientists do? Data scientists spend a surprising amount of time preparing data for analysis. In fact, a survey was conducted found that cleaning big data is the most time-consuming and least enjoyable task data scientists do!
<img src="https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg" width="700">

(image from [https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg](https://thumbor.forbes.com/thumbor/960x0/https%3A%2F%2Fblogs-images.forbes.com%2Fgilpress%2Ffiles%2F2016%2F03%2FTime-1200x511.jpg))

Some topics related to data science that we will cover in this class (at a high level) includes the following:
* [Data collection](https://en.wikipedia.org/wiki/Data_collection): Designing data collection protocols and executing the protocols to collect data to be used for analysis.
* [Data representation/cleaning/munging/wrangling](https://en.wikipedia.org/wiki/Data_wrangling): Describes the overall process of manipulating unstructured and/or messy data into a structured and clean form.
* [Data mining](https://en.wikipedia.org/wiki/Data_mining): The computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning): Provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data.

## Python
In this class, we are going to learn and use the Python programming language for all of our coding assignments. According to [IEEE Spectrum](https://spectrum.ieee.org/top-programming-languages-2022), Python is the top programming language and according to [KDNuggets](https://www.kdnuggets.com/2020/01/python-preferred-languages-data-science.html), Python is the most popular programming language for analytics, data mining, and data science (followed by R). 

### Why Use Programming for Data Science?
* Faster than analyzing by hand (especially for large data)!
* Reuse other data, same data different params/settings
* Enables a form of “repeatability” (and ideally, transparency)
    * can repeat “experiment” and get the same result
    * no “magic” steps
* Still important, however, to write down steps (log)
    * ideally, someone should be able to take your data, program, and description of steps, rerun everything, and get the same results!

### Why Use Python for Data Science?
Advantages of learning Python include:
1. Easy to learn
1. Free, open source
1. Support for the life cycle of software (prototyping, development, testing, release, maintenance)
1. Many available libraries, especially for data analytics:
    1. [numpy](http://www.numpy.org/)
    1. [scipy](https://www.scipy.org/)
    1. [sci-kits](https://scikits.appspot.com/) (especially [sci-kit learn](http://scikit-learn.org/stable/) for machine learning)
    1. [pandas](http://pandas.pydata.org/)
    1. [Plotting libraries](https://wiki.python.org/moin/NumericAndScientific/Plotting), such as [matplotlib](http://matplotlib.org/) and [Plotly](https://plot.ly/)
1. Many supported GUI backends
1. LOTS of community support/development online
1. Cross platform support
    * Python is an interpreted language, which means it can run on any system with the Python interpreter installed; however, this is also a disadvantage in some ways, meaning Python code can be slow to run, compared with compiled languages like C
    
### Python Distribution and IDE
We will use the [Anaconda 3](https://www.anaconda.com/products/individual) Python 3 distribution. This is a free distribution of Python version 3 available for Windows, OS X, and Linux. You can download Anaconda3 [here](https://www.anaconda.com/products/individual) and view the installation instructions [here](https://docs.anaconda.com/anaconda/install/).

Anaconda comes packaged with an easy-to-use integrated development environment (IDE) called [Spyder](http://spyder-ide.org/) (Scientific Python Development Environment) and (optionally) [Visual Studio Code](https://code.visualstudio.com/). I encourage you to use Spyder or Visual Studio Code (or one of the following [Anaconda-supported IDEs](https://support.anaconda.com/customer/en/portal/articles/2880333-using-an-ide-with-anaconda) to develop your Python code).

## Datasets
Our focus is “Tabular” Data ... aka Relational or Structured
* Data is organized into tables (rows and columns)

Age |Gender |Impressions |Clicks |SignedIn
-|-|-|-|-|
59 |1 |4 |0 |1
19 |0 |5 |0 |1
44 |1 |5 |0 |1
28 |1 |4 |0 |1
61 |1 |10 |1 |1
0 |0 |3 |1 |0

* You are already familiar with tabular data! Data in an Excel spreadsheet is structured in a tables

<img src="https://www.excel-easy.com/examples/images/online/new-sheet-view.png" width="400">

(image from https://www.excel-easy.com/examples/images/online/new-sheet-view.png)

* Each row is an "instance"
    * aka "example", "record", or "object"
* Each column is an “attribute” (of the instance)
    * aka "variables" or "fields"
* A "dataset" is a (sample) set of instances
    * from the "universe of objects" (universe of instances)

This is a sample of (simulated) daily website click stream data (Example from "Doing Data Science", Schutt and O’Neil)
* Each row contains attribute values for one user
* User’s age, gender (0=female, 1=male), ads shown, ads clicked, and if
logged in (0=no, 1=yes)