# Food Health Analysis

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

# How to complete and submit
Each exercise will look something like this:

```python
example_query = ''
#example_result = pd.read_sql(example_query, conn)
```

In each exercise you will need to define a query variable by writing the SQL code that you think will solve the problem. Once you have your query, uncomment the 2nd line, this will execute it and load the resulting data into a dataframe.

Nothing else needs to be changed in the 2nd line besides uncommenting it. 

After running this you will be free to inspect the result produced to see whether it's what you'd expect as the result. KATE will look for variables with the names defined in this notebook, so it is important not to rename the variables defined in this notebook.

Once you've completed the exercises upload this notebook to **KATE** to get feedback. You can also upload the notebook when you only have parts of it completed - if you do so, make sure you do not uncomment the `pd.read_sql` lines for which you don't have a query yet.

Refer to the instructions on **KATE** for more details on the dataset.

# Setup

The below code is setting up a connection to the SQLite Database. 

**Do not change this code!** The `conn` variable will be used throughout the notebook to query the database.

In [None]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('data/sfscores.sqlite')

# Background

This dataset has been taken from an open data platform (found __[here](https://data.sfgov.org/)__) hosted by the city of San Francisco. It is an sqlite file which is a database created using SQLite. SQLite is a lightweight verion of SQL which features similar syntax to the vast majority of SQL flavours. It provides a fast way of accessing databases and storing databases locally. 

The data includes restaurant scores produced by the San Francisco health department up to and including the year 2016. These scores are generated by a health inspector who bases them upon the violations observed, these violations fall into three categories: High risk, Moderate risk or Low risk. 

The database itself consists of three tables: 

* _**businesses**_: information relating to restaurant businesses such as the owner and location
* _**inspections**_: information about individual inspection events
* _**violations**_: information about violation events



Referencing these tables and their respective columns will be useful in answering the following questions. The code below will show the column names (shown as name) and datatypes (shown as type) for the businesses table. To see the same information for the other tables try substituting violations or inspections for businesses in the first line of code.


In [None]:
# Code to show column names and data types within each table
table_info_query = 'PRAGMA table_info(businesses)'
table_info = pd.read_sql(table_info_query, conn)
table_info[['name','type']]


# Queries

## Part 1: Essentials

**1. Write a SQL query that finds the number of business ids in the businesses table**

In [None]:
# Add your code below
# number_of_businesses_query = ...
# number_of_businesses_result = pd.read_sql(number_of_businesses_query, conn)


**2. Write a SQL query that finds out how many unique business names are registered with San Francisco food health department (i.e. all uniques businesses in the businesses table) and name the column as unique restaurant name count.**

In [None]:
# Add your code below
# unique_business_names_query = ...
# unique_business_names_result = pd.read_sql(unique_business_names_query, conn)


**3. Write a SQL query that finds out what is the earliest and latest date a health investigation is recorded in this database, you will find this information in the inspections table. Name these columns 'earliest date' and 'latest date'.**

In [None]:
# Add your code below
# earliest_and_latest_investigation_query = ...
# earliest_and_latest_investigation_result = pd.read_sql(earliest_and_latest_investigation_query, conn)


**4. How many businesses are there in San Francisco where their owners live in the same area (postal code/zip code) as the business is located?**

In [None]:
# Add your code below
# businesses_with_owners_nearby_query = ...
# businesses_with_owners_nearby_result = pd.read_sql(businesses_with_owners_nearby_query, conn)


**5. Out of those businesses, how many of them has a registered business certificate?**

In [None]:
# Add your code below
# businesses_with_registered_certificate_query = ...
# businesses_with_registered_certificate_result = pd.read_sql(businesses_with_registered_certificate_query, conn)


## Part 2: Groupby

**6. Find out the distribution of the risk exposure of all the violations reported in the database (i.e how many low, moderate and high risk violations are recorded). The first column of the result should 'risk_category' and the second column the count which should be callled 'frequency'.**

In [None]:
# Add your code below
# distribution_of_risk_exposure_query = ...
# distribution_of_risk_exposure_result = pd.read_sql(distribution_of_risk_exposure_query, conn)


**7. Find out the distribution of the risk exposure of all the violations reported in the database that are *water related*. Sort them by frequency (count) from high to low.**

In [None]:
# Add your code below
# distribution_of_water_risk_exposure_query = ...
# distribution_of_water_risk_exposure_result = pd.read_sql(distribution_of_water_risk_exposure_query, conn)


**8. What types of inspections do the authorities conduct and how often do they occur in general. Calculate the distribution of different types of inspections with their frequency (type, frequency) based on inspections records. Sort them in ascending order based on frequency.**

In [None]:
# Add your code below
# inspection_type_and_frequency_query = ...
# inspection_type_and_frequency_result = pd.read_sql(inspection_type_and_frequency_query, conn)


**9. What is the average score given to restaurants based on the type of inspection? Based on the results, identify the types of inspections that are not scored (NULL) and remove those categories from the resultset. The 'average_score' should be rounded to one decimal. Sort the results in ascending order based on the average score. Hint: use the function `ROUND(score, 1)`**

In [None]:
# Add your code below
# average_score_per_inspection_type_query = ...
# average_score_per_inspection_type_result = pd.read_sql(average_score_per_inspection_type_query, conn)


**10. Find the restaurant owners (owner_name) that own one or multiple restaurants in the city with the number of restaurants (num_restaurants) they own. Find the 10 owners who own the most restaurants and sort them by the number of restaurants they own in decreasing order.**

In [None]:
# Add your code below
# owners_with_restaurant_numbers_query = ...
# owners_with_restaurant_numbers_result = pd.read_sql(owners_with_restaurant_numbers_query, conn)


## Part 3: Subqueries and joins

**11. From the businesses table, find all owners that own more than five restaurants. Then  find the 10 most popular locations for  restaurants (using postal_code) amongst the owners who have 5 restaurant or more. The final result should return the 10 most popular areas  (postal_code) and the frequency with which they appear in our group of owners who own five restaurants or more. The result should have two columns (postal_code and frequency) and should be presented in descending order of frequency.**

In [None]:
# Add your code below
# most_popular_post_codes_query = ...
# most_popular_post_codes_result = pd.read_sql(most_popular_post_codes_query, conn)


**12. Now it might be interesting to look at some statistics. For all the restaurants in the "94103" post code let's calculate the minimum score (as "min_score"), average score (as "avg_score") and maximum S=score (as "max_score"). The average score should be rounded to one decimal and you should only consider restaurants that have undergone inspection (so the score is NOT NULL).**

In [None]:
# Add your code below
# min_avg_max_score_query = ...
# min_avg_max_score_result = pd.read_sql(min_avg_max_score_query, conn)


**13. Now we can get a bit more serious and look at how many times restaurants in the "94103" post code  have committed health violations. We can then group them based on their risk category so we know which restaurants to avoid. The output should have two columns (risk_category, frequency) and be sorted in descending order by frequency**

In [None]:
# Add your code below
# market_street_health_violations_query = ...
# market_street_health_violations_result = pd.read_sql(market_street_health_violations_query, conn)
