<a href="https://colab.research.google.com/github/DataEngineering-Amber/Collabs-and-Assignments/blob/main/SQL_Fundamentals_Completed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL Fundamentals

Welcome to the SQL Fundamentals Notebook! This notebook is designed to help you learn the basics of SQL, including selecting data, performing aggregates, joins, and data modeling. We'll be using SQLite for demonstration purposes, which is a lightweight, disk-based database that doesn't require a separate server process.

## Setup: Installing and Importing Required Libraries

First, we'll install and import the necessary libraries. We'll use `sqlite3` for database operations and `pandas` for data manipulation and display.

In [1]:
# Install pandas if not already installed
!pip install pandas



In [2]:
import sqlite3
import pandas as pd
from IPython.display import display

## Creating a Sample Database

Let's create a sample database with two tables: `Students` and `Courses`. The `Students` table will store student information, and the `Courses` table will store course information.

In [3]:
# Connect to SQLite (or create it if it doesn't exist)
conn = sqlite3.connect('school.db')
cursor = conn.cursor()

# Create Students table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Students (
    StudentID INTEGER PRIMARY KEY,
    FirstName TEXT NOT NULL,
    LastName TEXT NOT NULL,
    Age INTEGER,
    Major TEXT
);
''')

# Create Courses table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Courses (
    CourseID INTEGER PRIMARY KEY,
    CourseName TEXT NOT NULL,
    Credits INTEGER
);
''')

conn.commit()

### Inserting Sample Data

We'll insert some sample data into both tables to work with.

In [4]:
# Insert sample data into Students table
students = [
    (1, 'John', 'Doe', 20, 'Computer Science'),
    (2, 'Jane', 'Smith', 22, 'Mathematics'),
    (3, 'Alice', 'Johnson', 19, 'Physics'),
    (4, 'Bob', 'Lee', 21, 'Chemistry'),
    (5, 'Charlie', 'Brown', 23, 'Biology')
]

cursor.executemany('INSERT OR IGNORE INTO Students VALUES (?, ?, ?, ?, ?);', students)

# Insert sample data into Courses table
courses = [
    (101, 'Introduction to Programming', 3),
    (102, 'Calculus I', 4),
    (103, 'General Physics', 4),
    (104, 'Organic Chemistry', 3),
    (105, 'Biology 101', 3)
]

cursor.executemany('INSERT OR IGNORE INTO Courses VALUES (?, ?, ?);', courses)

conn.commit()

## 1. Selecting Data

The `SELECT` statement is used to retrieve data from a database. Let's explore how to select specific columns, filter results, and sort data.

### Selecting All Columns

To select all columns from the `Students` table:

In [5]:
query = 'SELECT * FROM Students;'
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,StudentID,FirstName,LastName,Age,Major
0,1,John,Doe,20,Computer Science
1,2,Jane,Smith,22,Mathematics
2,3,Alice,Johnson,19,Physics
3,4,Bob,Lee,21,Chemistry
4,5,Charlie,Brown,23,Biology


### Selecting Specific Columns

To select specific columns, specify them after `SELECT`.

In [6]:
query = 'SELECT FirstName, LastName, Major FROM Students;'
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,FirstName,LastName,Major
0,John,Doe,Computer Science
1,Jane,Smith,Mathematics
2,Alice,Johnson,Physics
3,Bob,Lee,Chemistry
4,Charlie,Brown,Biology


### Filtering Results with WHERE

Use the `WHERE` clause to filter results based on a condition.

In [7]:
query = "SELECT * FROM Students WHERE Age > 21;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,StudentID,FirstName,LastName,Age,Major
0,2,Jane,Smith,22,Mathematics
1,5,Charlie,Brown,23,Biology


### Sorting Results with ORDER BY

Use the `ORDER BY` clause to sort the results.

In [8]:
query = "SELECT * FROM Students ORDER BY Age ASC;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,StudentID,FirstName,LastName,Age,Major
0,3,Alice,Johnson,19,Physics
1,1,John,Doe,20,Computer Science
2,4,Bob,Lee,21,Chemistry
3,2,Jane,Smith,22,Mathematics
4,5,Charlie,Brown,23,Biology


## 2. Performing Aggregates

Aggregate functions perform a calculation on a set of values and return a single value. Common aggregate functions include `COUNT`, `SUM`, `AVG`, `MIN`, and `MAX`.

### COUNT

Count the number of students in each major.

In [9]:
query = "SELECT Major, AVG(age) as av_age FROM Students GROUP BY Major;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,Major,av_age
0,Biology,23.0
1,Chemistry,21.0
2,Computer Science,20.0
3,Mathematics,22.0
4,Physics,19.0


### AVG (Average)

Calculate the average age of students.

In [10]:
query = "SELECT AVG(Age) as AverageAge FROM Students;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,AverageAge
0,21.0


### SUM, MIN, and MAX

For demonstration, let's add a `Credits` column to the `Students` table representing the total credits each student has earned.

In [11]:
# Add Credits column to Students table
cursor.execute('ALTER TABLE Students ADD COLUMN Credits INTEGER DEFAULT 0;')

# Update Credits data
credits = [
    (1, 15),
    (2, 18),
    (3, 12),
    (4, 16),
    (5, 14)
]

cursor.executemany('UPDATE Students SET Credits = ? WHERE StudentID = ?;', [(credit, sid) for sid, credit in credits])

conn.commit()

Now, let's perform `SUM`, `MIN`, and `MAX` operations.

In [12]:
query = "SELECT SUM(Credits) as TotalCredits FROM Students;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,TotalCredits
0,75


In [13]:
query = "SELECT MIN(Credits) as MinCredits FROM Students;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,MinCredits
0,12


In [14]:
query = "SELECT MAX(Credits) as MaxCredits FROM Students;"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,MaxCredits
0,18


## 3. Joins

Joins are used to combine rows from two or more tables based on a related column between them. We'll create a third table, `Enrollments`, to demonstrate joins.

### Creating the Enrollments Table

The `Enrollments` table will link students to the courses they're enrolled in.

In [15]:
# Create Enrollments table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Enrollments (
    EnrollmentID INTEGER PRIMARY KEY,
    StudentID INTEGER,
    CourseID INTEGER,
    Grade TEXT,
    FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
    FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);
''')



# Insert sample data into Enrollments table
enrollments = [
    (1, 1, 101, 'A'),
    (2, 1, 102, 'B+'),
    (3, 2, 103, 'A-'),
    (4, 3, 104, 'B'),
    (5, 4, 105, 'A'),
    (6, 5, 101, 'B+'),
    (7, 2, 105, 'A'),
    (8, 3, 102, 'B-')
]

cursor.executemany('INSERT OR IGNORE INTO Enrollments VALUES (?, ?, ?, ?);', enrollments)

conn.commit()

### INNER JOIN

An `INNER JOIN` returns records that have matching values in both tables.

In [16]:
query = """
SELECT Students.StudentID, Students.FirstName, Students.LastName, Students.Major,
       Courses.CourseName, Enrollments.Grade
FROM Students
INNER JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
INNER JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,StudentID,FirstName,LastName,Major,CourseName,Grade
0,1,John,Doe,Computer Science,Introduction to Programming,A
1,1,John,Doe,Computer Science,Calculus I,B+
2,2,Jane,Smith,Mathematics,General Physics,A-
3,3,Alice,Johnson,Physics,Organic Chemistry,B
4,4,Bob,Lee,Chemistry,Biology 101,A
5,5,Charlie,Brown,Biology,Introduction to Programming,B+
6,2,Jane,Smith,Mathematics,Biology 101,A
7,3,Alice,Johnson,Physics,Calculus I,B-


### LEFT JOIN

A `LEFT JOIN` returns all records from the left table and the matched records from the right table. If there is no match, the result is `NULL` on the right side.

In [17]:
query = """
SELECT Students.StudentID, Students.FirstName, Students.LastName, Students.Major,
       Courses.CourseName, Enrollments.Grade
FROM Students
LEFT JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
LEFT JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,StudentID,FirstName,LastName,Major,CourseName,Grade
0,1,John,Doe,Computer Science,Introduction to Programming,A
1,1,John,Doe,Computer Science,Calculus I,B+
2,2,Jane,Smith,Mathematics,General Physics,A-
3,2,Jane,Smith,Mathematics,Biology 101,A
4,3,Alice,Johnson,Physics,Calculus I,B-
5,3,Alice,Johnson,Physics,Organic Chemistry,B
6,4,Bob,Lee,Chemistry,Biology 101,A
7,5,Charlie,Brown,Biology,Introduction to Programming,B+


## 4. Data Modeling

Data modeling involves designing the structure of a database. It includes defining tables, columns, data types, and the relationships between tables. We'll briefly discuss normalization and entity-relationship diagrams (ERDs).

### Normalization

Normalization is the process of organizing data to reduce redundancy and improve data integrity. There are several normal forms; we'll discuss the first three.

**First Normal Form (1NF):**
- Ensure that the table has a primary key.
- Eliminate duplicate columns.
- Ensure that each field contains only atomic (indivisible) values.

**Second Normal Form (2NF):**
- Achieve 1NF.
- Remove subsets of data that apply to multiple rows and place them in separate tables.
- Create relationships between these tables using foreign keys.

**Third Normal Form (3NF):**
- Achieve 2NF.
- Remove columns that are not dependent on the primary key.

### Entity-Relationship Diagram (ERD)

An ERD visually represents the tables (entities) in a database and their relationships. Here's a simple ERD for our sample database:

- **Students** (`StudentID` PK)
- **Courses** (`CourseID` PK)
- **Enrollments** (`EnrollmentID` PK, `StudentID` FK, `CourseID` FK)

This shows that each student can enroll in multiple courses, and each course can have multiple students, establishing a many-to-many relationship facilitated by the `Enrollments` table.

## Exercises

Try the following exercises to test your understanding of SQL fundamentals.

### Exercise 1: Retrieve Specific Data

Write a SQL query to retrieve the first and last names of all students majoring in 'Computer Science'.

In [18]:
query = "SELECT FirstName, LastName FROM Students WHERE Major = 'Computer Science';"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,FirstName,LastName
0,John,Doe


### Exercise 2: Aggregate Functions

Calculate the average number of credits earned by students majoring in 'Biology'.

In [19]:
query = "SELECT AVG(Credits) AS AvgCredits FROM Students WHERE Major = 'Biology';"
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,AvgCredits
0,14.0


### Exercise 3: Join Tables

Write a SQL query to list all courses along with the names of students enrolled in each course.

In [20]:
query = """
SELECT Students.FirstName, Students.LastName, Courses.CourseName
FROM Students
JOIN Enrollments ON Students.StudentID = Enrollments.StudentID
JOIN Courses ON Enrollments.CourseID = Courses.CourseID;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,FirstName,LastName,CourseName
0,John,Doe,Introduction to Programming
1,John,Doe,Calculus I
2,Jane,Smith,General Physics
3,Alice,Johnson,Organic Chemistry
4,Bob,Lee,Biology 101
5,Charlie,Brown,Introduction to Programming
6,Jane,Smith,Biology 101
7,Alice,Johnson,Calculus I


### Exercise 4: Data Modeling

Design a new table called `Professors` with appropriate columns and establish a relationship between `Professors` and `Courses`.

In [21]:
# Create Professors table
cursor.execute('''
CREATE TABLE IF NOT EXISTS Professors (
    ProfessorID INTEGER PRIMARY KEY,
    FirstName TEXT NOT NULL,
    LastName TEXT NOT NULL,
    Department TEXT NOT NULL
);
''')

conn.commit()
print("Professors table created successfully!")

Professors table created successfully!


In [22]:
# Add ProfessorID to Courses table
try:
    cursor.execute('ALTER TABLE Courses ADD COLUMN ProfessorID INTEGER REFERENCES Professors(ProfessorID);')
except sqlite3.OperationalError:
    pass  # Ignore error if column already exists

conn.commit()
print("ProfessorID column added to Courses table!")

ProfessorID column added to Courses table!


In [23]:
# Insert sample professors
professors = [
    (1, 'Dr. Alan', 'Turing', 'Computer Science'),
    (2, 'Dr. Isaac', 'Newton', 'Mathematics'),
    (3, 'Dr. Marie', 'Curie', 'Physics'),
    (4, 'Dr. Rosalind', 'Franklin', 'Biology')
]

cursor.executemany('INSERT OR IGNORE INTO Professors VALUES (?, ?, ?, ?);', professors)
conn.commit()
print("Sample professors inserted successfully!")

Sample professors inserted successfully!


In [24]:
# Assign professors to courses
course_professors = [
    (1, 101),  # Dr. Turing teaches Intro to Programming
    (2, 102),  # Dr. Newton teaches Calculus I
    (3, 103),  # Dr. Curie teaches General Physics
    (4, 104),  # Dr. Franklin teaches Organic Chemistry
    (4, 105)   # Dr. Franklin also teaches Biology 101
]

cursor.executemany('UPDATE Courses SET ProfessorID = ? WHERE CourseID = ?;', course_professors)
conn.commit()
print("Professors assigned to courses successfully!")

Professors assigned to courses successfully!


In [25]:
query = """
SELECT Courses.CourseName, Professors.FirstName, Professors.LastName
FROM Courses
JOIN Professors ON Courses.ProfessorID = Professors.ProfessorID;
"""
df = pd.read_sql_query(query, conn)
display(df)

Unnamed: 0,CourseName,FirstName,LastName
0,Introduction to Programming,Dr. Alan,Turing
1,Calculus I,Dr. Isaac,Newton
2,General Physics,Dr. Marie,Curie
3,Organic Chemistry,Dr. Rosalind,Franklin
4,Biology 101,Dr. Rosalind,Franklin


## Conclusion

This notebook provided an overview of SQL fundamentals, including data selection, aggregation, joins, and basic data modeling. Practice these concepts by experimenting with the sample data and completing the exercises.

## Cleanup

It's good practice to close the database connection when you're done.

In [26]:
# Close the connection
conn.close()