# Reflective Writing for Data Science Career Path - Codecademy
by Charalampos Spanias - January 2021

## Content
1. Getting Started with Data Science
2. Python Fundamentals
3. [Data Acquisition](#Acquisition)
    1. [Introduction: Data Acquisition](#Intro)
    2. [Getting Started with Data Acquisition](#GettingStarted)
    3. [SQL](#SQL)
    4. [Web Scraping](#WebScraping)

<a name="Acquisition"></a>
# 3. Data Acquisition

 *IBM started out SQL as **SEQUEL** (**Structured English QUEry Language**) in the 1970’s to query databases.*

<a name="Intro"></a>
# 3.1 Introduction: Data Acquisition
**Data acquisition** (also called **data mining**) is the process of gathering data.

>Posing question(s) &rarr; **Data Acquisition** &rarr; Data Cleaning

**Questions to ask when acquiring data**:
1. What data is needed to achieve the business goal?
1. How much data is needed to produce valuable insight and modeling?
1. Where and how can this data be found?
1. What legal and privacy parameters should be considered?

<a name="GettingStarted"></a>
# 3.2 Getting Started with Data Acquisition
**Methods of Data Acquisition**:
1. [Public & Private Data](#PublicPrivate)
1. [Web Scraping](#Scraping)
1. [APIs](#APIs)
1. [Manual Data Acquisition](#Manual)

<a name="PublicPrivate"></a>
## 3.2.1 Public & Private Data
1. [GitHub](https://github.com/)
1. [Kaggle](https://www.kaggle.com/)
1. [KDnuggets](https://www.kdnuggets.com/)
1. [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)
1. [US Government’s Open Data](https://www.data.gov/)
1. [Five Thirty Eight](https://data.fivethirtyeight.com/)
1. [Amazon Web Services](https://aws.amazon.com/)

<a name="Scraping"></a>
## 3.2.2 Web Scraping
You can scrape data two ways: **manually and programmatically**. 

**Manually** just means literally copying and pasting it. 

**Programmatically** means **writing a “bot” or a “crawler”** that will systematically scan the page for data that fits the parameters you specify. For example, you may want everything that is in a Paragraph `<p>` tag. It is helpful to know a little bit of **HTML/CSS** to scrape a site, but not essential. 

Python has useful libraries that allow you to implement both methods of web scraping, such as **BeautifulSoup**, **Selenium**, and **Scrapy**. 

It’s important to note that web scraping has **ethical and legal considerations**. A lot of companies prefer not to be scraped because it can lead to misuse of their data and a slowed user experience. Some companies create **barriers to scraping** like including **CAPTCHA tests, timeouts, or hiding the content**. 

<a name="APIs"></a>
## 3.2.3 APIs
**Application Programming Interfaces (APIs)** are another method that can be used for data acquisition. 

API’s are built around the **HTTP Request/Response Cycle**. A client (you) sends a **request to a website’s server** for data through an **API call**, then the **server** searches its database and **responds** back to the client either with the data, or an error stating that request can not be fulfilled.

Many APIs request that users sign up and obtain an **API key** that uniquely identifies them and records of all a user’s requests.

![apis.PNG](attachment:apis.PNG)

<a name="Manual"></a>
## 3.2.4 Manual Data Acquisition

When data is not available anywhere, you need to **harvest them yourself**.

**Google Forms** is a simple and free way to create **surveys** that can be shared with others to acquire data related to a particular population. Google also offers **Google Surveys**, a paid service that allows you to have a wider range of respondents and gives you more control in determining your target audience.

Devices like **Nvidia’s Jetson Nano** and **Arduino’s Uno boards** are great for **acquiring data from your local environment**. With developer kits like these, you can create **sensor systems that can harvest data**, or **run machine learning models that can acquire even more complex data about the environment**. 

In the event of working on **novel projects**, one will most often need to acquire the necessary data themselves, and these are useful tools to be familiar with in that event.

<a name="SQL"></a>
# 3.3 SQL
1. [Relational Databases for Data Science](#RDB)
2. [Manipulation](#Manipulation)
3. [Queries](#Queries)
4. [Python with Databases](#Python)
5. [Thinking in SQL vs Thinking in Python](https://mode.com/blog/learning-python-sql/) &larr; *Article Link*
6. [Programming with Databases](https://swcarpentry.github.io/sql-novice-survey/10-prog/index.html) &larr; *Tutorial Link*

<a name="RDB"></a>
## 3.3.1 Relational Databases for Data Science
1. [Structured vs. Unstructured Data](#StrUns)
2. [Databases](#DBs)
3. [Relational Database Management Systems](#RDBMS)
4. [SQLite Data Types](#DataTypes)

<a name="StrUns"></a>
### 3.3.1.1 Structured vs. Unstructured Data
**Relational databases** is the format of **structured data**, and we can interact with them with the programming language, **Structured Query Language (SQL)**. Structured data follows a data model, which is a kind of blueprint that defines what is in each column (or variable) and row (or observation), such as **pandas DataFrames** and **MySQL tables**. The defined structure allows the data to be **easily indexed for efficient reference**.

**Unstructured data** (or **semi-structured)** is basically anything else such as **csv files**, **word documents**, and even **NoSQL databases**.

<a name="DBs"></a>
### 3.3.1.2 Databases
**Databases** are **collections of data that are organized for efficient accessibility and management**. Within the data science pipeline, the **storage units** of the data science pipeline.

There are multiple database models that specify how a database is structured. The **flat model** is the most simple and is essentially a table. The **relational model** can be viewed as a database model that has multiple tables that each describe a particular entity of the database.

**Relational databases** (RBs) are **the primary means of storage for structured data** and they organize data into tables that each contain data related to one another.

When we consider RBs as a collection of tables, what we call a **schema**, we can visualize them with **entity-relationship diagrams**, which give us a chance to view the data within each table, and how each table relates to the others. The shared fields among the tables are called **keys**.

![schema.PNG](attachment:schema.PNG)

<a name="RDBMS"></a>
### 3.3.1.3 Relational Database Management Systems (RDBMS)

**RDBMS** are important for data science and analytics because they provide the functionality needed for creating, reading, updating, and deleting data, often referred to as **CRUD** within our database. 

The language that data teams utilize most often in RDBMS to execute commands is **SQL**, pronounced as “S-Q-L” or “sequel”. SQL is one of the **most common and powerful** languages for querying databases. It is fast, secure, and able to return millions of data points in just a few lines. Some of the **most common RDBMS that use SQL** are:

1. [MySQL](https://www.mysql.com/) is a popular **free and open-source** SQL database. It is widely used for **web applications** because the MySQL Server is renowned for its **speed**, **reliability**, and **ease of use on multiple platforms**.

2. [PostgreSQL](https://www.postgresql.org/) is a popular **open-source** SQL database. PostgreSQl is **one of the oldest RDBMS** with over 30 years into its development, so it has an **extensive community** supporting it and is known for its **reliability** and **array of features**.

3. [Oracle DB](https://www.oracle.com/database/) is considered to be among the most popular of all RDBMS. Owned by Oracle Corporation and **closed sourced**, it is **the goto RDBMS for corporations** as it is **able to scale** for and support their massive workloads effectively.

4. [SQL Server](https://www.microsoft.com/en-us/sql-server) is a **closed-sourced** RDBMS that is popular, especially among **corporations**. While Microsoft offers SQL Server **free** through its **SQL Server 2019 Express edition**, the **enterprise editions** that are designed for large **scale** applications with more **functionality** become more expensive as your application scales.

5. [SQLite](https://www.sqlite.org/index.html) is another popular **open-source** SQL database. SQLite is designed to be **compact**, **efficient**, and **self-contained**. SQLite is able to store a complete database in a single cross-platform disk file so that it is **not necessary to connect databases to a server**. These characteristics and capabilities are what make SQLite considered to be the **most used RDBMS**, as it is used in most cell phones, computers, and several other daily used devices.

It is important to note that while **most RDBMS use SQL**, the **SQL data types can vary between RDBMS**.

<a name="DataTypes"></a>
### 3.3.1.4 SQLite Data Types
With unstructured data, you would be able to enter any data in any order. However, in a relational database, we are able to **restrict certain fields to a specific data type**. 

![datatypes.PNG](attachment:datatypes.PNG)

<a name="Manipulation"></a>
## 3.3.2 Manipulation
1. [Create](#Create)
1. [Insert](#Insert)
1. [Select](#Select)
1. [Alter](#Alter)
1. [Update](#Update)
1. [Delete](#Delete)
1. [Constraints](#Constraints)

SQL operates through **simple, declarative statements**.

[List with SQL commands](https://www.codecademy.com/article/sql-commands).

<a name="Create"></a>

**CREATE TABLE celebs ( <br>
   id INTEGER, <br>
   name TEXT, <br>
   age INTEGER <br>
);**

`CREATE TABLE` a **clause**. <br>

`celebs` the **name of the table**. <br>

`(id INTEGER, name TEXT, age INTEGER)` a **list of parameters** defining each column, or attribute in the table and its data type.

<a name="Insert"></a>

**INSERT INTO celebs (id, name, age) <br>
VALUES (1, 'Justin Bieber', 22);**

`INSERT INTO` a **clause** that **adds the specified row or rows**.

`celebs` **table** the row is added to.

`(id, name, age)` **parameter** identifying the columns that data will be inserted into.

`VALUES` **clause** that indicates the data being inserted.

`(1, 'Justin Bieber', 22)` **parameter** identifying the values being inserted.

<a name="Select"></a>

**SELECT name <br>
FROM celebs;**

`SELECT` clause that **indicates that the statement is a query**.

`name` specifies the column to query data from.

`FROM celebs` specifies the name of the table to query data from.

**SELECT * <br>
FROM celebs;**

`*` is a special **wildcard character** that allows you to select every column in a table.

`SELECT` statements always **return a new table** called **the result set**.

<a name="Alter"></a>

**ALTER TABLE celebs <br> 
ADD COLUMN twitter_handle TEXT;**

`ALTER TABLE` **clause** that **lets you make the specified changes**.

`ADD COLUMN` **clause** that lets you **add a new column** to a table.

`NULL` is a special value in SQL that represents **missing or unknown data**.

<a name="Update"></a>

**UPDATE celebs <br>
SET twitter_handle = '@taylorswift13' <br>
WHERE id = 4;**

`UPDATE` clause that **edits** a row in the table.

`SET` clause that indicates the column to edit.

`WHERE` a clause that indicates which row(s) to update with the new column value.

<a name="Delete"></a>

**DELETE FROM celebs <br>
WHERE twitter_handle IS NULL;**

`DELETE FROM` clause that lets you **delete rows** from a table.

`WHERE`clause that lets you select **which rows** you want to delete.

`IS NULL` a condition in SQL that returns true when the value is NULL and false otherwise.

<a name="Constraints"></a>

**Constraints** that add information about how a column can be used are invoked after specifying the data type for a column. They can be used to tell the database to **reject inserted data that does not adhere to a certain restriction**.

**CREATE TABLE celebs ( <br>
   id INTEGER PRIMARY KEY,  <br>
   name TEXT UNIQUE, <br>
   date_of_birth TEXT NOT NULL, <br>
   date_of_death TEXT DEFAULT 'Not Applicable' <br>
);**


`PRIMARY KEY` columns can be used to **uniquely identify the row**.
`UNIQUE` columns have a **different value** for every row.

`NOT NULL` columns **must have a value**.

`DEFAULT` columns take an **additional argument** that will be the **assumed value for an inserted row** if the new row does not specify a value for that column.

<a name="Queries"></a>
## 3.3.3 Queries
1. [Select](#Select)
    1. [As](#As)
    1. [Distinct](#Dinstict)
1. [Where](#Where)
    1. [Like](#Like)
    1. [is Null](#isNull)
    1. [Between](#Between)
    1. [And](#And)
    1. [Or](#Or)
1. [Order By](#OrderBy)
1. [Limit](#Limit)
1. [Case](#Case)

<a name="Select"></a>

### SELECT
`SELECT column1, column2
FROM table_name;`

<a name="As"></a>
#### AS
`SELECT name AS 'Titles'
FROM movies;`

**`AS`** allows you to **rename** a column or table using an alias.

It’s **best practice** to surround your aliases with **single quotes**.

When using **`AS`**, the **columns are not being renamed in the table**, the aliases only appear in the result.

<a name="Distinct"></a>
#### DISTINCT

`SELECT DISTINCT tools <br>
FROM inventory;`

**`DISTINCT`** is used to **return unique values**.

<a name="Where"></a>
### WHERE

`SELECT <br>
FROM movies <br>
WHERE imdb_rating > 8;`

**`WHERE` filters** the result set to only include rows where the following **condition is true**.

*Similar to **if**, works with **comparison operators**.*

<a name="Like"></a>
#### LIKE

`SELECT *   <br>
FROM movies  <br>
WHERE name LIKE 'Se_en';`

**`LIKE`** a **special operator** used with **`WHERE`** to search for a **specific pattern** in a column.

The **`_`** (**wildcard**) means you can substitute any individual character here without breaking the pattern.
***
`SELECT *  <br>
FROM movies  <br>
WHERE name LIKE '%man%';`

**`%`** is a **wildcard character** that matches **zero or more missing letters in the pattern**.

Here, any movie that contains the word ‘man’ in its name will be returned in the result.

**`LIKE`** is **not case sensitive**. ‘Batman’ and ‘Man of Steel’ will both appear in the result of the query above.

<a name="isNull"></a>
#### IS NULL

`SELECT name <br>
FROM movies <br>
WHERE imdb_rating IS NOT NULL;`

Unknown values are indicated by **`NULL`**. It is not possible to test for **`NULL`** values with comparison operators.

Instead, we will have to use these operators:
* **`IS NULL`**
* **`IS NOT NULL`**




<a name="Between"></a>
#### BETWEEN

`SELECT *
FROM movies
WHERE year BETWEEN 1990 AND 1999;`

**`BETWEEN`** is used with **`WHERE`** to **filter the result set within a certain range**. It accepts two values that are either **numbers, text or dates**.

For example, the above statement filters the result set to only include movies with years from 1990 up to, and **including 1999**.

`SELECT *
FROM movies
WHERE name BETWEEN 'A' AND 'J';`

When the values are **text**, it filters the result set for within the **alphabetical range**.

In the above statement, it will include movies with names that begin with the letter ‘A’ up to, but **not including ones that begin with ‘J’**. However, if a movie has a name of simply **‘J’**, it would actually match. This is because BETWEEN goes up to the second value — up to ‘J’. So the movie named ‘J’ would be included in the result set but not ‘Jaws’.

<a name="And"></a>
#### AND

`SELECT * 
FROM movies
WHERE year BETWEEN 1990 AND 1999
   AND genre = 'romance';`

With **`AND`**, **both conditions must be true** for the row to be included in the result.

<a name="Or"></a>
#### OR

`SELECT *
FROM movies
WHERE year > 2014
   OR genre = 'action';`

**`OR`** displays a row **if any condition is true**.

<a name="OrderBy"></a>
### ORDER BY

`SELECT *
FROM movies
ORDER BY name;`

**`ORDER BY`** **sorts** results, either alphabetically or numerically.

For example, if we want to sort everything by the movie’s title from A through Z:

`SELECT *
FROM movies
WHERE imdb_rating > 8
ORDER BY year DESC;`

**`DESC`** \ **`ASC`**

The column that we **`ORDER BY`** doesn’t even have to be one of the columns that we’re displaying.

**Note**: **`ORDER BY`** always goes after **`WHERE`** (if **`WHERE`** is present).

<a name="Limit"></a>
### LIMIT

`SELECT *
FROM movies
LIMIT 10;`

**`LIMIT`** lets you specify the **maximum number of rows the result set** will have.

It always goes at the **very end of the query**. Also, it is **not supported in all SQL databases**.

<a name="Case"></a>
### CASE

`SELECT name,
 CASE
  WHEN imdb_rating > 8 THEN 'Fantastic'
  WHEN imdb_rating > 6 THEN 'Poorly Received'
  ELSE 'Avoid at All Costs'
 END
FROM movies;`

Usually used with **`SELECT`**. It is SQL’s **if-then** logic.

The **`CASE`** statement must end with **`END`**.

In the result, you have to scroll right because the **column name is very long**. To shorten it, we can rename the column to ‘Review’ using **`AS`**:

`SELECT name,
 CASE
  WHEN imdb_rating > 8 THEN 'Fantastic'
  WHEN imdb_rating > 6 THEN 'Poorly Received'
  ELSE 'Avoid at All Costs'
 END AS 'Review'
FROM movies;`

[More SQL commands](https://www.codecademy.com/paths/data-science/tracks/dscp-data-acquisition/modules/dscp-sql/articles/sql-commands).

<a name='Python'></a>
## 3.3.4 Python with Databases

Python’s **Database-API** (DB-API) 2.0 connects Python to RDBMS like PostgreSQL(psycopg2), MySQL(mysqlclient), Oracle(pyodbc), and SQLite(**sqlite3**). 


[**sqlite3**](https://docs.python.org/3/library/sqlite3.html) module allows us to **manipulate data in SQLite DBs** from **within our Python** script.


1. [Connecting to SQLite in Python](#Connecting)
2. [Create cursor object](#cursor)
3. [Executing SQL Statements in Python](#Executing)
4. [Reading our SQL data with Python](#Reading)
5. [SQLite with Pandas](#Pandas)

<a name="Connecting"></a>
### 3.3.4.1 Connecting to SQLite in Python
**`import sqlite3`** <br>

An **Application Programmable Interface** (API) is simply a way that we can communicate between different applications.

**`conn = sqlite3.connect("first.db")`**

This call will **either connect** to the database named, **or create** that database if it does not already exist.
We can imagine our connection object as a cable that connects our python environment to our SQLite database.

In [1]:
import sqlite3

# create database
conn = sqlite3.connect("fist.db")

<a name="Cursor"></a>
### 3.3.4.2 Create cursor object
**`cursor = conn.cursor()`**

Next we need **a way to call SQL statements on the data** within the database. 

A **cursor object** represents a database cursor, and can be used to call statements to our SQLite database, and return the data in our python environment.

In [2]:
# create cursor object
cursor = conn.cursor()

<a name="Executing"></a>
### 3.3.4.3 Executing SQL Statements in Python
**`cursor.execute()`**
                  
**Note**: We are using a **triple-quoted String** to make a multi-line String. 

In [3]:
# create a table
cursor.execute('''CREATE TABLE students (
                    id INTEGER PRIMARY KEY,
                    name TEXT NOT NULL,
                    email TEXT NOT NULL UNIQUE,
                    major_code INTEGER,
                    grad_date datetime,
                    grade REAL NOT NULL)''')

# insert a new row
cursor.execute('''INSERT INTO students VALUES (101, 'Alex', 'alex@codeu.com', 32, '2022-05-16', 'Pass')''')

OperationalError: table students already exists

**`executemany()`**

To insert **multiple values at once** we can use the **`executemany()`** method, a variation of the execute method which allows us to execute **multiple commands in a single API call**.

In [None]:
students = [(102, 'Joe', 'joseph@codeu.com', 32, '2022-05-16', 'Pass'),
            (103, 'Stacy', 'stacy@codeu.com', 10, '2022-05-16', 'Pass'),
            (104, 'Angela', 'angela@codeu.com', 21, '2022-12-20', 'Pass'),
            (105, 'Mark', 'mark@codeu.com', 21, '2022-12-20', 'Fail'),
            (106, 'Nathan', 'nathaniel@codeu.com', 21, '2022-12-20', 'Pass')
            ]
 
cursor.executemany('''INSERT INTO students VALUES (?,?,?,?,?,?)''', students)

**`connection.commit()`**

We use 6 **question marks as placeholders** to represent each of the 6 fields in the database that we will insert values into. 

We need to ensure that the changes will be visible to others who may be working with our database.

In [None]:
# Commit changes to database
conn.commit()

<a name="Reading"></a>
### 3.3.4.4 Reading our SQL data with Python

To read the data within our database, we can use multiple methods. The most simple is to use a **for loop** that iterates through our database and calls some SQL statement.

In [None]:
for row in cursor.execute("SELECT * FROM students"):
    print(row)

There are a number of sqlite3 methods that will **retrieve data**:

**`fetchone()`**

When we simply want to **return the first row**.

In [None]:
# return first row
cursor.execute("SELECT * FROM students").fetchone()

**`fetchmany()`**

To **return a specific number of rows**.

In [None]:
# return the first 3 rows
cursor.execute("SELECT * FROM students").fetchmany(3)

**`fetchall()`**

If we want to **return all rows**.

In [None]:
# return all rows
cursor.execute("SELECT * FROM students").fetchall()

**Note**: Using **for loops** and the **`fetchone()`** return **tuples**, while **`fetchmany()`** and **`fetchall()`** return **lists of tuples**.

We can use **Python methods** **`sum()`** and **`len()`** on our result set to obtain the mean value of the field.

In [None]:
# create a list of tuples of the major codes
major_codes = cursor.execute("SELECT major_code FROM students;").fetchall()

# obtain the average of the tuple list by using for loops
sum = 0
for num in major_codes: 
    for i in num: 
        sum = sum + i 
average = sum / len(major_codes)
print(average)

<a name="Pandas"></a>
### 3.3.4.5 SQLite with Pandas
**`read_sql_query`**

Takes in a **query and a connection as parameters** and returns a DataFrame corresponding to the output of the query.

In [None]:
import pandas as pd

# create a new dataframe from the result set
df = pd.read_sql_query('''SELECT * from students;''', conn)
df

We can also **read a pandas DataFrame** and then **covert it to an SQLit3 DB**.

**`df_to_sql()`**

In [None]:
# use read_csv to read in data as a pandas dataframe
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
 
# show DataFrame
df

In [None]:
# instantiate a connection
connection = sqlite3.connect("titanic.db")
 
# instantiate a cursor
cursor = connection.cursor()
 
# create a table
df.to_sql("titanic", connection)

<a name="WebScraping"></a>
# 3.4 Web Scraping
1. [Rules of Scraping](#Rules)
2. [Requests](#Requests)
3. [The Beautiful Soup Object](#BSObject)
4. [Object Types](#ObjectTypes)
5. [Navigating by Tags](#Tags)
6. [Website Structure](#Structure)
7. [Find All](#FindAll)
8. [Select for CSS Selectors](#CSSS)
9. [Reading Text](#Text)
10. [Chocolate Project](#Project)

<a name="Rules"></a>
## Rules of Scraping

1. Check the **legal use** of the site's data.
2. Do not spam site with requests (1 request / second).

<a name="Requests"></a>
## Requests

`request` library

<a name="BSObject"></a>
## The Beautiful Soup Object

Pull out the HTML parts of the page that we need.

`soup = BeautifulSoup("name.html", "html.parser")`

<a name="ObjectTypes"></a>
## Object Types

BS breaks down the HTML page into several types of objects.

**Tag** &rarr; HTML tag

<a name="Tags"></a>
## Navigating by Tags

<a name="Structure"></a>
## Website Structure
We need to know the website structure and what we are looking for.

Most browsers have [**Dev Tools**](https://www.codecademy.com/article/use-devtools) &rarr; **inspect** the website (see its HTML elements).

**HTML Inspection** &rarr; **Locate the required info**

<a name="FindAll"></a>
## Find All
`find_all()` &rarr; all the occurences of a tag

It is very **flexible** &rarr; it can take **regexes**, **attributes**, and **functions**!

<a name="CSSS"></a>
## Select for CSS Selector
`select()` &rarr; takes in all of the CSSS of a `.css` file

<a name="Text"></a>
## Reading Text
`get_text()` &rarr; retrieve the text inside of a tag