# Load Data

1. In the following cell, insert a Spark Session DataFrame from the file panel for your user.json.bz2 file which contains the user data from the Yelp Dataset Challenge.  

2. In the cell below that, add your bucket name as the second parameter in the call to `cos.url` to define a URL to the user data file.

In [107]:
# The code was removed by DSX for sharing.

In [108]:
path_user = cos.url('user.json.bz2', 'spring2018andy023363be332e40639c4287c87e0af5e0')

In [109]:
df_user = spark.read.json(path_user)
print "user count:", df_user.count()
df_user.printSchema()

user count: 1326101
root
 |-- average_stars: double (nullable = true)
 |-- compliment_cool: long (nullable = true)
 |-- compliment_cute: long (nullable = true)
 |-- compliment_funny: long (nullable = true)
 |-- compliment_hot: long (nullable = true)
 |-- compliment_list: long (nullable = true)
 |-- compliment_more: long (nullable = true)
 |-- compliment_note: long (nullable = true)
 |-- compliment_photos: long (nullable = true)
 |-- compliment_plain: long (nullable = true)
 |-- compliment_profile: long (nullable = true)
 |-- compliment_writer: long (nullable = true)
 |-- cool: long (nullable = true)
 |-- elite: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- fans: long (nullable = true)
 |-- friends: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- funny: long (nullable = true)
 |-- name: string (nullable = true)
 |-- review_count: long (nullable = true)
 |-- useful: long (nullable = true)
 |-- user_id: string (nullable = true)
 |-

### Create a Temporary View

The follwoing cell shows creating a temporary view from the `df_user` DataFrame.  There are two ways to create a temporary view.  here we are 
using the `createOrReplaceTempView` method of the DataFrame.  If you don't remember this long name, as long as your DataFrame has already been created, 
just type the DataFrame's name follwowed by a period and then start typing the first few characters of the command.  At that point, press the tab key and 
Jupyter will prompt you with a list of method names to select from.

In the command for creating the temporary view, the string passed as the parameter is used to name the temporary view.  Whn querrying using Spark SQL, you
will use this name as though it were a table.

In [110]:
 df_user.createOrReplaceTempView("user")

### Load the SSA Gender Data

This is the ssa_name_gender.tsv file that is based on the Social Security Administration data
and the wrangling we did previously.  A DataFrame from that was written out to 
Object Storage and downloaded so you could upload it here.

1. In the cell below insert a Spark Session Setup for the file.  Since this is the second file you are loading from that bucket, the boilerplate is not added.
2. Change the name of the varaible being declared from `path_1` to `path_gender`
3. When you run the cell after that, it creates a DataFrame based on the URL from your cloud object storage and then creates a temporary view named `gender`.

In [111]:

# Please read the documentation of PySpark to learn more about the possibilities to load data files.
# PySpark documentation: https://spark.apache.org/docs/2.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession
# The SparkSession object is already initialized for you.
# The following variable contains the path to your file on your IBM Cloud Object Storage.
path_2 = cos.url('ssa_name_gender.tsv.bz2', 'spring2018andy023363be332e40639c4287c87e0af5e0')


In [112]:
path_gender = cos.url('ssa_name_gender.tsv.bz2', 'spring2018andy023363be332e40639c4287c87e0af5e0')

In [113]:
df_gender = spark.read.option("header","true").option("sep","\t").csv(path_gender)
print "name count:", df_gender.count()

df_gender.printSchema()

df_gender.createOrReplaceTempView("gender")

name count: 96174
root
 |-- name: string (nullable = true)
 |-- F: string (nullable = true)
 |-- M: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- gender_ratio: string (nullable = true)



# Basic Queries Against Temporary Views

In the following cell is the skeleton of a SELECT query against the User table.  The required fields in each query are:

* SELECT (this tells Spark which "columns" we want from the data)
* FROM (this tells Spark what temporary view to get the data from)


Take a look at the schema for the df_user DataFrame.  The fields we want to include are the user_id, name, and colums for the userful, funny, and cool votes that user has cast.
What is the name of your temporary view for users?  That goes in the FROM clause.

In [114]:
print "SQL SELECT QUERY: "

spark.sql("""
SELECT user_id as Users, name, useful, funny, cool
FROM user
""").show(truncate=False)

SQL SELECT QUERY: 
+----------------------+-------+------+-----+----+
|Users                 |name   |useful|funny|cool|
+----------------------+-------+------+-----+----+
|oMy_rEb0UBEmMlu-zcxnoQ|Johnny |0     |0    |0   |
|JJ-aSuM4pCFPdkfoZ34q0Q|Chris  |0     |0    |0   |
|uUzsFQn_6cXDh6rPNGbIFA|Tiffy  |0     |0    |0   |
|mBneaEEH5EMyxaVyqS-72A|Mark   |0     |0    |0   |
|W5mJGs-dcDWRGEhAzUYtoA|Evelyn |0     |0    |0   |
|4E8--zUZO1Rr1IBK4_83fg|Lisa   |4     |0    |0   |
|Ob-2oGBQ7rwwYwUvhmnf7g|B      |0     |0    |0   |
|JaTVvKsBl0bHHJEpESn4pQ|Peter  |0     |0    |0   |
|Ykj0DVsz0c6rX9ghjd0hDg|Colleen|0     |0    |0   |
|kmyEPfKnHQJdTceCdoyMQg|A      |0     |0    |0   |
|H54pA7YHfjl8IjhHAfdXJA|Chad   |0     |0    |0   |
|WRae-wZkpRoxMrgJdqwyxg|Mike   |0     |0    |0   |
|Mmv5fPxbF8XEMN4EPT_Khg|Chris  |0     |0    |0   |
|LdqGHXsNQowMrvgTNburJA|Susan  |3     |1    |0   |
|TsgBsn19Wjwpyo81gF9_8Q|Cathy  |0     |0    |0   |
|V--GjQPlTpeWbcB2cS06Gw|Cody   |2     |0    |1   |
|a_gKYQ5YMg3

# Adding Conditions with a WHERE clause

the SELECT and FROM as used above will return every row in the temporary view.  This is similar to the `filter` method for DataFrames. If we want only certain rows, 
we need to add a WHERE clause to the query.  In the following cell, run the same query, but this time, get only
those users who have written over 50 reviews:

In [115]:
print "SQL WHERE CLAUSE: "

spark.sql("""
SELECT user_id as Users, name as Name, yelping_since as Yelper_Since, review_count as Reviews
FROM user
WHERE review_count > 50
""").show(truncate=False)

SQL WHERE CLAUSE: 
+----------------------+---------+------------+-------+
|Users                 |Name     |Yelper_Since|Reviews|
+----------------------+---------+------------+-------+
|bZkZgll3Fii18x3WRtB5Lg|Dale     |2010-01-06  |62     |
|aw973Pm1nrTbRjP4zY9B9g|Kenny    |2008-09-23  |762    |
|B46q0uJGiuzTtcYXknkgcQ|izagui   |2008-01-14  |83     |
|oMT1lSrACglwgFPDBYEIuA|mmaaxxmül|2011-02-06  |54     |
|X-yLZasrQYb4hWTdP-fOGQ|K And A  |2013-10-14  |54     |
|m-Pnm53eK4NFW6keiL7yjQ|Donny    |2010-07-17  |149    |
|ioT4UHRwWlbAidIJHfnvOg|Randy    |2006-11-06  |154    |
|wm97KC6G0resSDXTmNIMKw|Dwain    |2012-06-08  |1262   |
|i5jSTSpXJtvM-ExWRttglw|Evelina  |2014-03-17  |85     |
|dQ5yKW2B1M-EvS1BqKfxWg|David    |2012-10-18  |109    |
|snDjs1hdh7JOWv4jjbXPDw|Michael  |2008-11-19  |447    |
|NSszc7yDLIlt1tzINtGNRg|Ryan     |2011-06-22  |200    |
|hgLpWCiE3tWvBYfP0q2wLg|jim      |2010-01-14  |54     |
|r-00EZGnVEHCRH7IkVDN1Q|bef      |2008-07-25  |59     |
|MiDcQ-bgIg4B91reFV4Qaw|Carly

In [None]:
# This somewhat ugly cell in the notebook is to align the table in the next cell
%%html
<style>
table {float:left}
</style>

### Comparisons in the WHERE Clause
Most of the comparisons you will be doing you learned in grade school.  
Following is a table of the possible comparisons you can do.  


| Operator | What it Does | Example |
| --- | --- | --- |
|  <       | less than | review_count < 100 |
| <=       | less than or equal to | review_count <= 100 |
| >        | greater than | review_count > 100 |
| >=        | greater than or equal to | review_count >= 100 |
| =        | equal | review_count = 100 |
| <>        | not equal to | review_count <> 100 |
| +        | add two fields | funny + cool > 100 | 
| -        | subtract two fields | funny - cool > 100 |
| *        | multiply two fields | funny * 2 > useful |
|  /       | divide one field by another<br/> (The divisor on the bottom cannot be zero) |funny / 2 > useful |
| MOD      | remainder of dividing one number by another | funny MOD 10 > 5<br/><br/> Means that if funny is divided by 10, <br/>the remainder is greater than 5, so a <br/>funny value of 16 matches (16 / 10 = 6) |
| BETWEEN *x and *y | matches values in a range <br/>(including the end points)  | funny BETWEEN 5 AND 20 |
| NOT      | finds rows that don't match the condition | funny NOT BETWEEN 5 and 20 <br/>matches rows where funny is less than 5 or more than 20 |
| IS NULL  | matches rows where the specified field is null | funny IS NULL |
| IS NOT NULL | this is an example of combining `NOT` and `IS NULL` | matches all rows where funny is not NULL |

### Adding Multiple Conditions in the WHERE Clause
You can have multiple conditions in the WHERE clause. If you want to combine multiple conditions, use `AND` and `OR` to combine 
the conditions and use parenthesis to set the order of precedence for how the ANDs and ORs are considered (your junior high
school math teacher said that would be useful some day).

In the next cell, create a query that gets those users who have written over 50 reviews and have voted cool more than 20 times:

In [119]:
spark.sql("""
SELECT user_id as Users, name as Name, yelping_since as Yelper_Since, review_count as Reviews, 
        compliment_cool as Cool
FROM user
WHERE review_count > 50 AND compliment_cool > 20
""").show(truncate=False)

+----------------------+---------+------------+-------+----+
|Users                 |Name     |Yelper_Since|Reviews|Cool|
+----------------------+---------+------------+-------+----+
|aw973Pm1nrTbRjP4zY9B9g|Kenny    |2008-09-23  |762    |47  |
|wm97KC6G0resSDXTmNIMKw|Dwain    |2012-06-08  |1262   |94  |
|snDjs1hdh7JOWv4jjbXPDw|Michael  |2008-11-19  |447    |22  |
|MiDcQ-bgIg4B91reFV4Qaw|Carly    |2012-08-22  |110    |172 |
|bBRPy8zUvNc0NGbGmkjrZg|Jan      |2009-03-30  |462    |764 |
|37Hc8hr3cw0iHLoPzLK6Ow|Christine|2008-03-03  |496    |310 |
|oH9K7eCuNsYr6MmlM2ZjUg|Buo      |2007-11-10  |902    |42  |
|z9MozWK9f7C8p3Gj0uOiHw|Marna    |2009-10-17  |246    |105 |
|bzMzZE3OCqHhZyXH5JRaWw|Lucy     |2008-09-29  |851    |169 |
|qSh-q8M-rL4PRVukXsDwWg|Ellen    |2007-12-28  |228    |30  |
|lmJy4OwP_TyHIg8a8Q0RsA|Alan     |2013-06-14  |646    |73  |
|keLUgL_4y60BkppiAsIk8Q|Hazel    |2014-06-21  |229    |54  |
|et_GDGFfG2BFVkLzRK2mTQ|Linda    |2008-07-31  |269    |42  |
|KBVL9aPlcLVwqyFQ__EeIA|

### Ordering multiple conditions

In the follwing cell, modify your query you have been doing so that you get those reviewers who have written over 50 reviews and have either:
* Cast between 10 and 20 funny votes
* Cast 20 or more cool votes

In [120]:
print "Multiple Conditions: "

spark.sql("""
SELECT user_id as Users, name as Name, yelping_since as Yelper_Since, review_count as Reviews, 
        compliment_funny as Funny, compliment_cool as Cool
FROM user
WHERE review_count > 50 
AND compliment_funny >= 20 
OR compliment_cool BETWEEN 10 AND 20
""").show(truncate=False)

Multiple Conditions: 
+----------------------+---------+------------+-------+-----+----+
|Users                 |Name     |Yelper_Since|Reviews|Funny|Cool|
+----------------------+---------+------------+-------+-----+----+
|aw973Pm1nrTbRjP4zY9B9g|Kenny    |2008-09-23  |762    |47   |47  |
|wm97KC6G0resSDXTmNIMKw|Dwain    |2012-06-08  |1262   |94   |94  |
|i5jSTSpXJtvM-ExWRttglw|Evelina  |2014-03-17  |85     |13   |13  |
|snDjs1hdh7JOWv4jjbXPDw|Michael  |2008-11-19  |447    |22   |22  |
|MiDcQ-bgIg4B91reFV4Qaw|Carly    |2012-08-22  |110    |172  |172 |
|bBRPy8zUvNc0NGbGmkjrZg|Jan      |2009-03-30  |462    |764  |764 |
|37Hc8hr3cw0iHLoPzLK6Ow|Christine|2008-03-03  |496    |310  |310 |
|oH9K7eCuNsYr6MmlM2ZjUg|Buo      |2007-11-10  |902    |42   |42  |
|z9MozWK9f7C8p3Gj0uOiHw|Marna    |2009-10-17  |246    |105  |105 |
|bzMzZE3OCqHhZyXH5JRaWw|Lucy     |2008-09-29  |851    |169  |169 |
|qSh-q8M-rL4PRVukXsDwWg|Ellen    |2007-12-28  |228    |30   |30  |
|cT5d9cgC3It82XTlOUCH5w|Alicia   |2011-1

# Aggregation, Grouping and the GROUP BY Clause

The next clause to add to our query is aggregation.  So far we have been selecting individual rows and the results have not involved any interaction between rows in the data.

What if we wanted to the the total number of reviews that have been written by the users in our dataset and the total number of cool votes they have cast?  That's what the following query does (run it).

Why are we aliasing the calculated sums? `SUM(review_count) AS reviews`  What happens if you do not as `AS reviews`? Try it out.

In the following query we are using SUM, but there are other aggregation functions you can use.  Some of the more common aggregation functions are listed in the followng table.  The "what's returned" column assumes you are not doing any grouping.

Run the query

| Function | Example | What's Returned |
| ----- | ---- | --- |
| SUM | SUM(cool) | total of the cool votes cast |
| AVG | AVG(review_count) | the average number of reviews <br/>written across all of the users |
|STDDEV | STDDEV(review_count) | the standard deviation of the <br/>review_count across all of the users |
| MIN | MIN(review_count) | the lowest review count <br/>of any user in the data |
| MAX | MAX(review_count) | the most reviews any user <br/>in the data has written |
| COUNT | COUNT(user_id) | counts the number of users in the data<br/>since we can get a count of the DataFrame, <br/>this is more useful when grouping |
| DISTINCT | COUNT (DISTINCT name) | DISTINCT is not a grouping function, but is <br/>often used with grouping - it excludes<br/>duplicates from the result. |

In [121]:
spark.sql("""
SELECT SUM(review_count) AS reviews, SUM(cool) AS cool_votes
FROM user
""").show(truncate=False)

+--------+----------+
|reviews |cool_votes|
+--------+----------+
|30655691|25996163  |
+--------+----------+



### Adding a GROUP BY clause

The above query returns a single row with totals for all of our users, but what if we wanted to group our users?

For example, we have the average star rating for each user (across all of their reviews), so what if we wanted to group them into 5 groups 1 - 5 based on rounding their average stars and then calculate:
* The number of users in each group
* The average number and standard deviation of reviews written by the users in each group
* The average number of useful votes written by each user in each group

We could write the following query:

In [122]:
spark.sql("""
SELECT ROUND(average_stars,0) AS stars,
       COUNT(user_id) AS users,
       AVG(review_count) AS reviews, 
       STDDEV(review_count) AS rev_stddev,
       AVG(useful) AS useful_votes
FROM user
GROUP BY ROUND(average_stars,0)
ORDER BY stars DESC
""").show(truncate=False)

+-----+------+------------------+------------------+------------------+
|stars|users |reviews           |rev_stddev        |useful_votes      |
+-----+------+------------------+------------------+------------------+
|5.0  |380756|5.3597448234564915|13.525511145777797|3.3632299950624547|
|4.0  |499538|42.75221905040257 |110.99116669141631|62.74042615376608 |
|3.0  |265617|24.56398122108148 |79.52535220373554 |21.700990523949898|
|2.0  |86226 |6.6548488854869765|12.35533766805391 |5.141662607566164 |
|1.0  |93964 |1.70433357456047  |2.5777271211025155|0.9024945723894258|
+-----+------+------------------+------------------+------------------+



# Handy Math Functions 

If you have used Excel, the above round function, which is rouding the stars to zero decimal places should look familiar.  The following table lists some other math functions you are likely to find handy in your data wrangling (there are many more).

| Function |  Example | What it Does |
| -------- | -------- | ------------ |
| ROUND(x,y) | ROUND(average_stars,0)| Rounds the field named as the first parameter <br/>to the number of decimal places in <br/>the second parameter | 
| CEILING(x) | CEILING(average_stars)| Rounds up to the next integer value, <br/>so if the average_stars <br/>for a user was 3.4 or 3.8, the <br/>example shown would return 4 for <br/>that user |
| FLOOR(x) | FLOOR(average_stars)| Rounds down to the next integer value, <br/>so if the average_stars <br/>for a user was 3.4 or 3.8, the <br/>example shown would return 3 for <br/>that user |

### Sorting the Results

To see the results of the above query in sorted order (descending order based on the `stars` field) add the following line after the GROUP BY clause and rerun the query:

`
ORDER BY stars DESC
`

Note that in the ORDER BY clause we can use the alias for the `stars` field instead of the formula.

### Filtering the Aggregated Results

Earlier we used the WHERE clause to filter the rows included in your query's result, once you have aggregated, you can filter again using a HAVING clause.  Copy your query from above to the following blank cell and then add the following line between your GROUP BY and ORDER BY clauses:

`
HAVING users >= 100000
`

This tells Spark to filter the result so that only the groups that have at least 100,000 users should be included in the result.

In [123]:
spark.sql("""
SELECT ROUND(average_stars,0) AS stars,
       COUNT(user_id) AS users,
       AVG(review_count) AS reviews, 
       STDDEV(review_count) AS rev_stddev,
       AVG(useful) AS useful_votes
FROM user
GROUP BY ROUND(average_stars,0)
HAVING users >= 100000
ORDER BY stars DESC
""").show(truncate=False)

+-----+------+------------------+------------------+------------------+
|stars|users |reviews           |rev_stddev        |useful_votes      |
+-----+------+------------------+------------------+------------------+
|5.0  |380756|5.3597448234564915|13.525511145777797|3.3632299950624547|
|4.0  |499538|42.75221905040257 |110.99116669141631|62.74042615376608 |
|3.0  |265617|24.56398122108148 |79.52535220373554 |21.700990523949898|
+-----+------+------------------+------------------+------------------+



# JOIN to Bring Together Multiple Temporary Views

The JOIN matches rows between two (or more) views based on whether the field(s) specified in the join match.

We will explore three types of joins you might use:

|Type of Join | What the Result Includes |
| ----------- | ------------------------ |
| INNER JOIN | Includes rows from both tables only if the matching field value is in both views  |
| LEFT OUTER JOIN | Includes all of the rows from the view on the left in the JOIN <br/>( e.g., viewA INNER JOIN viewB would have ViewA as the view on the left) |
| FULL OUTER JOIN | Includes each row in each table, regardless of whether it's in the other table |

In the following cell we are doing an inner join of the user and gender views based on the name fields.

In [124]:
spark.sql("""
SELECT U.user_id, U.name, U.review_count, U.average_stars,
       G.gender, G.gender_ratio
FROM user AS U INNER JOIN gender AS G
ON LOWER(U.name) = LOWER(G.name)
""").show(truncate=False)

+----------------------+-------+------------+-------------+------+------------------+
|user_id               |name   |review_count|average_stars|gender|gender_ratio      |
+----------------------+-------+------------+-------------+------+------------------+
|oMy_rEb0UBEmMlu-zcxnoQ|Johnny |8           |4.67         |2     |0.988185431137896 |
|JJ-aSuM4pCFPdkfoZ34q0Q|Chris  |10          |3.7          |2     |0.8625031523579637|
|uUzsFQn_6cXDh6rPNGbIFA|Tiffy  |1           |2.0          |1     |1.0               |
|mBneaEEH5EMyxaVyqS-72A|Mark   |6           |4.67         |2     |0.9966800205261983|
|W5mJGs-dcDWRGEhAzUYtoA|Evelyn |3           |4.67         |1     |0.9966969953487197|
|4E8--zUZO1Rr1IBK4_83fg|Lisa   |11          |3.45         |1     |0.9971182300888615|
|JaTVvKsBl0bHHJEpESn4pQ|Peter  |2           |5.0          |2     |0.9966299930617504|
|Ykj0DVsz0c6rX9ghjd0hDg|Colleen|1           |1.0          |1     |0.9979678927047348|
|H54pA7YHfjl8IjhHAfdXJA|Chad   |3           |5.0      

### LEFT OUTER JOIN

Copy the above query to the cell below and modify it to do a LEFT OUTER JOIN

In [125]:
spark.sql("""
SELECT U.user_id, U.name, U.review_count, U.average_stars,
       G.gender, G.gender_ratio
FROM user AS U LEFT OUTER JOIN gender AS G
ON LOWER(U.name) = LOWER(G.name)
""").show(truncate=False)

+----------------------+-------+------------+-------------+------+------------------+
|user_id               |name   |review_count|average_stars|gender|gender_ratio      |
+----------------------+-------+------------+-------------+------+------------------+
|oMy_rEb0UBEmMlu-zcxnoQ|Johnny |8           |4.67         |2     |0.988185431137896 |
|JJ-aSuM4pCFPdkfoZ34q0Q|Chris  |10          |3.7          |2     |0.8625031523579637|
|uUzsFQn_6cXDh6rPNGbIFA|Tiffy  |1           |2.0          |1     |1.0               |
|mBneaEEH5EMyxaVyqS-72A|Mark   |6           |4.67         |2     |0.9966800205261983|
|W5mJGs-dcDWRGEhAzUYtoA|Evelyn |3           |4.67         |1     |0.9966969953487197|
|4E8--zUZO1Rr1IBK4_83fg|Lisa   |11          |3.45         |1     |0.9971182300888615|
|Ob-2oGBQ7rwwYwUvhmnf7g|B      |9           |4.78         |null  |null              |
|JaTVvKsBl0bHHJEpESn4pQ|Peter  |2           |5.0          |2     |0.9966299930617504|
|Ykj0DVsz0c6rX9ghjd0hDg|Colleen|1           |1.0      

### FULL OUTER JOIN

Copy the above query to the cell below and modify it to do a FULL OUTER JOIN

In [126]:
spark.sql("""
SELECT U.user_id, U.name, U.review_count, U.average_stars,
       G.gender, G.gender_ratio
FROM user AS U FULL OUTER JOIN gender AS G
ON LOWER(U.name) = LOWER(G.name)
""").show(truncate=False)

+----------------------+---------------+------------+-------------+------+------------+
|user_id               |name           |review_count|average_stars|gender|gender_ratio|
+----------------------+---------------+------------+-------------+------+------------+
|8yuMP766uRPe6sEcxS1AXQ|'Nikki         |5           |4.14         |null  |null        |
|MS_IRLwpy-TUGNyUIDRq-Q|1Southerngyrl76|1           |1.0          |null  |null        |
|k7aF28JCVxi14O4qO_2fEw|A-dogg         |46          |4.0          |null  |null        |
|null                  |null           |null        |null         |2     |1.0         |
|WwsMmD9R4Oz0ZqmsY4VxTQ|Advice4u       |4           |4.0          |null  |null        |
|Zu7gGW6dlGIf3iDG929_oQ|Adwinnie       |40          |3.85         |null  |null        |
|qMubZnWM_lLRXqYwEbbG9g|Agustinus      |20          |4.14         |null  |null        |
|null                  |null           |null        |null         |1     |1.0         |
|null                  |null    

### Using the IF function to Check for Null

Something weird happens with the FULL OUTER JOIN.  We have a name for those names in the gender data but that are not in the user data, but that name is not in our result because our SELECT clause
includes `U.name`.  What we really want is to get the name from the user view (with the alias "U") when there is a user with that name, but if not, then we want the name from the gender view.

The solution is to use the IF function in SQL instead of `u.name`.  The format of the IF function is as follows:

`
IF(<some condition>, <result if true>, <result if false>)
`    
 
Copy the FULL OUTER JOIN query to the cell below and use the IF function to fix this problem with your query.  If you are not sure how, take a look back at the table of operators listed earlier in this notebook.

In [127]:
print "IF function to check for NULL: "

spark.sql("""

SELECT U.user_id, IF (U.name IS NULL, G.name, U.name) as Name, U.review_count, U.average_stars,
       G.gender, G.gender_ratio
FROM user AS U FULL OUTER JOIN gender AS G
ON LOWER(U.name) = LOWER(G.name)
""").show(truncate=False)

IF function to check for NULL: 
+----------------------+---------------+------------+-------------+------+------------+
|user_id               |Name           |review_count|average_stars|gender|gender_ratio|
+----------------------+---------------+------------+-------------+------+------------+
|8yuMP766uRPe6sEcxS1AXQ|'Nikki         |5           |4.14         |null  |null        |
|MS_IRLwpy-TUGNyUIDRq-Q|1Southerngyrl76|1           |1.0          |null  |null        |
|k7aF28JCVxi14O4qO_2fEw|A-dogg         |46          |4.0          |null  |null        |
|null                  |acheron        |null        |null         |2     |1.0         |
|WwsMmD9R4Oz0ZqmsY4VxTQ|Advice4u       |4           |4.0          |null  |null        |
|Zu7gGW6dlGIf3iDG929_oQ|Adwinnie       |40          |3.85         |null  |null        |
|qMubZnWM_lLRXqYwEbbG9g|Agustinus      |20          |4.14         |null  |null        |
|null                  |ahaana         |null        |null         |1     |1.0         |


# You're Done With This Exercise

Be sure to do the following:
* Save your notebook and then save a version of your notebook (from the File menu)
* Create a sharable link (without sensitive cells)
* Upload your link as the deliverable for the exercise
* Test your link - is it showing what you want to submit?
