
# [CPSC 322]() Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/) |
[Sophina Luitel](https://www.gonzaga.edu/school-of-engineering-applied-science/faculty/detail/sophina-luitel-phd-0dba6a9d)

---

# Introduction
What are our learning objectives for this lesson?
* Understand the general field of data science
  

Content used in this lesson is based upon information in the following sources:
* Adapted from Dr. Gina Sprint's Data Science Algorithm Course, Fall 2024

## Warm up Task(s)
* Welcome to class! I'm really glad you are here
* Task #1: Make sure you can access Canvas: https://canvas.gonzaga.edu
* Task #2: Meet your neighbor and see how they are doing 😀

## Today
1. Introduction
1. Class resources overview
1. What is Data Science?
1. TODO: 
    1. Before next class: Install [Git](https://git-scm.com/downloads), install [Docker Desktop](https://www.docker.com/products/docker-desktop), and install [VS Code](https://code.visualstudio.com/download)
        1. See [U0 Introduction Lesson B Environment Setup]() for details on how to do this
    1. Start working on your first assignment, Mini Assignment #1 (MA1) (Learn/Brush up on your Python): Uploaded in Canvas 
        1. We will go over Git/Github and Docker next class
    1. Please complete the welcome questionnaire in Canvas

## What is Data Science?

Data science studies how to use data to solve problems and answer research questions. It combines statistics, computer science and domain knowledge.

It constitutes all parts of a data-intensive workflow, from the beginning (e.g., data collection, data preparation), through the middle (e.g., data mining, supervised/unsupervised machine learning) to the end (e.g., presenting insights and/or deploying a software system). 
<img src="https://raw.githubusercontent.com/DataScienceAlgorithms/M1_Introduction/main/figures/datascience.png" alt= "Data Science" style="display:block; margin-left:auto; margin-right:auto;" width="300" height="300" />
Image from: [Data Science Advisory | Shelly Palmer](https://media.shellypalmer.com/wp-content/images/2015/08/data-science-venn-600px2-compressor-600x619.png)


As you can see in the image above, becoming proficient in data science requires a mix of hard skills and soft skills. A strong foundation in **statistics** and **mathematics** is essential for analyzing and visualizing data. At the core lies **machine learning**, the engine that powers predictions and intelligent systems. To apply machine learning effectively, you also need coding skills, from implementing algorithms to handling data preprocessing, model training, and evaluation. Equally important is a solid grasp of the **domain** you are working in, so you can frame business problems or research questions accurately.

But technical expertise alone isn’t enough. Once you’ve drawn conclusions or built models, you must be able to communicate your results clearly to stakeholders. Good communication ensures that your work has real-world impact.

In short, data science covers the entire data-driven workflow—from the beginning (data collection and preparation), through the middle (data mining, supervised and unsupervised learning), all the way to the end (presenting insights or deploying systems).

## Data Science Workflow

* Data Collection: Gathering the information you need from surveys, databases, sensors, or other sources. (This is where you get your raw data.)
* Data Wrangling (Cleaning): Fixing messy data by handling missing values, formatting issues, and duplicates so it’s ready to use.
* Data Exploration: Looking at the data closely to spot patterns, trends, or unusual points. (Sometimes called “exploratory data analysis” or EDA.)
* Modeling: Using statistical models or machine learning algorithms to make predictions or understand relationships in the data.
* Visualization & Communication: Presenting your findings clearly using charts, dashboards, or reports, so others can understand and act on your results.

**What are examples of data in the real-world and how is that data being analyzed?**
* Medical data collected from electronic health records, physician/nurse notes, etc.
    * Analyzed to determine health risk factors, onset of early disease, insurance billing, etc.
* Time series data collected from sensors installed in the environment or worn on the body (wearables)
    * Analyzed to detect physical activity, daily behavior, changes in behavior over time, etc.
* Social media data collected from social networks, posting, news feeds, etc.
    * Analyzed to suggest friends, deliver user-specific content, recommend products, target advertising, etc.
* Financial data collected from banking transactions, trading, etc.
    * Analyzed to project stock market trends, recommend certain investments, determine credit scores, etc.
* Many others

## What do Data Scientists do? 

Data scientists spend a surprising amount of time preparing data for analysis.
Data scientists spend about 45% of their time on data preparation tasks, including loading and cleaning data, according to a survey of data scientists conducted by Anaconda. [Big Data Wire.](https://www.bigdatawire.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/)

<img src="https://raw.githubusercontent.com/DataScienceAlgorithms/M1_Introduction/mainfigures/ds.png" style="display:block; margin-left:auto; margin-right:auto;" width="300" height="300" />

Image from: [Big Data Wire.](https://www.bigdatawire.com/2020/07/06/data-prep-still-dominates-data-scientists-time-survey-finds/)

Some topics related to data science that we will cover in this class (at a high level) includes the following:
* [Data collection](https://en.wikipedia.org/wiki/Data_collection): Designing data collection protocols and executing the protocols to collect data to be used for analysis.
* [Data representation/cleaning/munging/wrangling](https://en.wikipedia.org/wiki/Data_wrangling): Describes the overall process of manipulating unstructured and/or messy data into a structured and clean form.
* [Data mining](https://en.wikipedia.org/wiki/Data_mining): The computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.
* [Machine learning](https://en.wikipedia.org/wiki/Machine_learning): Provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can change when exposed to new data.

## What Kind of Data do Data Scientists Analyze?
**Structured Data**
* Structured data is organized in a clear, predefined format.
* Structured data can include both quantitative data (such as prices or revenue figures) and qualitative data (such as dates, names, addresses and credit card numbers). 
* Structured data is typically stored in tabular formats, such as Excel spreadsheets and [relational databases](https://www.ibm.com/topics/relational-databases) (or SQL databases).
---
**Unstructured Data**
* Unstructured data does not have a predefined format. 
* Unstructured data can contain both textual and nontextual data and both qualitative (social media comments) and quantitative (figures embedded in text) data.
---

* Our focus is "Tabular" Data ... aka Relational or Structured
    * Data is organized into tables (rows and columns)

Age |Gender |Impressions |Clicks |SignedIn
-|-|-|-|-|
59 |1 |4 |0 |1
19 |0 |5 |0 |1
44 |1 |5 |0 |1
28 |1 |4 |0 |1
61 |1 |10 |1 |1
0 |0 |3 |1 |0

* You are already familiar with tabular data! Data in an Excel spreadsheet is structured in a tables
  
<img src="https://www.excel-easy.com/examples/images/online/new-sheet-view.png" width="400">

(image from https://www.excel-easy.com/examples/images/online/new-sheet-view.png)
* Each row is an "instance"
    * aka "example", "record", or "object"
* Each column is an "attribute" (of the instance)
    * aka "variables" or "fields"
* A "dataset" is a (sample) set of instances
    * from the "universe of objects" (universe of instances)


## Types of Attributes
* **Qualitative Attributes:** These attributes represent categories and do not have a meaningful numeric interpretation. Examples include gender, color, or product type. 
    * Nominal: Categories without an inherent order (e.g., colors, education status).
    * Ordinal: Categories with a meaningful order but undefined intervals (e.g., medal rankings in a competition, such as gold, silver, and bronze.).
* **Quantitative Attributes:**
    * Discrete: Countable values, often integers (e.g., number of children).
    * Continuous: Any value within a range, including fractions (e.g., height, temperature).


## Python
In this class, we are going to learn and use the Python programming language for all of our coding assignments. According to [IEEE Spectrum](https://spectrum.ieee.org/top-programming-languages-2022), Python is the top programming language and according to [KDNuggets](https://www.kdnuggets.com/2020/01/python-preferred-languages-data-science.html), Python is the most popular programming language for analytics, data mining, and data science (followed by R). 

### Why Use Programming for Data Science?
* Faster than analyzing by hand (especially for large data)!
* Reuse other data, same data different params/settings
* Enables a form of "reproducibility" (and ideally, transparency)
    * Can repeat "experiment" and get the same result
    * No "magic" steps
* Still important, however, to write down steps (log)
    * ideally, someone should be able to take your data, program, and description of steps, rerun everything, and get the same results!

### Why Use Python for Data Science?
Advantages of learning Python include:
1. Easy to learn
1. Free, open source
1. Support for the life cycle of software (prototyping, development, testing, release, maintenance)
1. Many available libraries, especially for data analytics:
    1. [numpy](http://www.numpy.org/)
    1. [scipy](https://www.scipy.org/)
    1. [sci-kits](https://scikits.appspot.com/) (especially [sci-kit learn](http://scikit-learn.org/stable/) for machine learning)
    1. [pandas](http://pandas.pydata.org/)
    1. [Plotting libraries](https://wiki.python.org/moin/NumericAndScientific/Plotting), such as [matplotlib](http://matplotlib.org/) and [Plotly](https://plot.ly/)
1. Many supported GUI backends
1. LOTS of community support/development online
1. Cross platform support
    * Python is an interpreted language, which means it can run on any system with the Python interpreter installed; however, this is also a disadvantage in some ways, meaning Python code can be slow to run, compared with compiled languages like C
    
### Python Distribution and IDE
We will use the [Anaconda Python 3 distribution](https://hub.docker.com/r/continuumio/anaconda3/) running using a containerization service called [Docker](https://www.docker.com/). More on how to install Docker and use Anaconda to write Python 3 code in the next lesson.

While you can use any text editor or IDE to write Python code, I encourage you to use Visual Studio (VS) Code. More on how to install VS Code and use it the next lesson.

## Why Learn Data Science Algorithms?
If there are great libraries out there, why does this class exist? Why not just use those libraries? Why do we have to implement some data science algorithms ourselves?
* If you don't know how an algorithm works, you can only treat it as a "black box"
![](https://miro.medium.com/max/1400/1*1wy_l-q16tmNbVkLOaCwOQ.png)
    * (image from https://miro.medium.com/max/1400/1*1wy_l-q16tmNbVkLOaCwOQ.png)
* What happens when something goes wrong?
* How do you explain how your model works?
* How do you improve your model?