# SQL and Python practice

## First off: Congrats on the job interview! 🎉🎊

![](https://i.gifer.com/7WZ0.gif)

In this tutorial, we will put our practice on all the basics SQL questions before diving into data manipulation using pandas in python. It's alright if you can't answer any of the questions (including questions in the upcoming interview). What's important is the ability to convey your thoughts and ideas to the interviewer and let them understand your thought process. Company will hire someone who can convey their ideas across complex topics so don't worry about not understanding some questions! 

### Step 1: Run the code below to import the required packages and functions to run SQL queries.

In [None]:
# Import the required packages
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Connecting to database.sqlite file
conn = sqlite3.connect("../input/sf-salaries/database.sqlite")
#Function to display the query result as a dataframe
def query_result(query):
    cursor = conn.cursor()#creating a cursor object to run the query
    cursor.execute(query) #execute the query pass as an argument to this function
    df = pd.DataFrame(cursor.fetchall())#fetching the results (raw result is a list) and converting it to dataframe
    #SQLite query result doesn't return column names of table. So we get the column names from the description of cursor 
    df.columns = [col_name[0] for col_name in cursor.description]
    cursor.close()
    return df

### Step 2: Now let's run some SQL query!
As you can see below we select all the columns from the Salaries table. Usually what happen is you will import the required data from the database into your IDE (Rstudio, Pycharm, Jupyter notebook, etc) and use Python or R to make some data manipulation. Some may ask if we can just directly import **ALL** the data from SQL database and do the required cleaning using Python or R. Yes you can do that if **your data size is small** and even so it's a bad practice. Ideally, you want to only import the required data you need in SQL and **if possible, do the required filtering and cleaning on SQL** before importing them to your IDE.  

In [None]:
all_data = query_result("SELECT * FROM Salaries;")
all_data

### Step 3: Select the columns you think are important to your analysis.
I will answer this one for you this time. The remaining question you will need to work out on your own.
Here, I choose all the columns except id.

In [None]:
df = query_result("SELECT EmployeeName, JobTitle, BasePay, OvertimePay, OtherPay, Benefits, TotalPay, TotalPayBenefits, Year, Notes, Agency, Status FROM Salaries;")
df

### Step 4: Get rid of those rows with "Not provided" and 0.00 data using SQL
Here, use a where clause.

In [None]:
# Remove Not provided and 0.00 from the data
df <- query_result("")
df

### Step 5: Identify how many unique job titles are there
Hint: Use distinct

In [None]:
# Identify how many unique job titles
df <- query_result("")
df

### Step 6: Identify how many jobs appear each year from 2011 to 2014. Generate a line plot using Matplotlib to show the trend. Explain what kind of trend are we observing.
Hint: Use group by and count

In [None]:
df <- query_result("")
df

### Step 7: Identify how many times the word "MACHINE" appears in the JobTitle column.
Hint: Use Like

In [None]:
df <- query_result("")
df

### Step 8: Calculate the average total pay for each year and evaluate the trend using a line plot.

In [None]:
df <- query_result("")
df

### Step 9: Identify the top 10 jobs with the highest base pay. Plot them using a barplot.

In [None]:
df <- query_result("")
df

### Step 10: Identify the top 10 jobs with the lowest base pay. Plot them using a barplot.

In [None]:
df <- query_result("")
df

### Step 11: Identify the rows with the terms "junior", "senior" and "chief" and compare their average salary using a boxplot.

In [None]:
df <- query_result("")
df

## Pandas data manipulation
Here, for convinient sake, we will import all the column except for ID in the SQL database and assign it to a variable call, all_data.

In [None]:
# Run this and don't change anything.
all_data = query_result("SELECT EmployeeName, JobTitle, BasePay, OvertimePay, OtherPay, Benefits, TotalPay, TotalPayBenefits, Year, Notes, Agency, Status FROM Salaries;")

### Step 1: Lets get started with some data manipulation using pandas. Use it on the all_data variable as assigned above.
Remove rows with "Not provided" and 0.00 using pandas solution

In [None]:
# Enter code here
clean_data = 

### Step 2: Identify EmployeeName that hold multiple positions over the 4 years. Use it on the clean_data variable assigned in Step 1, where you cleaned off all the redundant data.

In [None]:
# Enter code here

### Step 3: Identify EmployeeName that only hold a single position over the 4 years

In [None]:
# Enter code here.

### Step 4: Identify all unique job titles

In [None]:
# Enter code here

### Step 5: Identify the top 10 common jobs 

In [None]:
# Enter code here

### Step 6: Identify the top 10 best paying jobs (based on base pay only)

In [None]:
# Enter code here.