## Sholastic Data Challenge Introduction
This project is based on an Analytics Challenge hosted by the Institute for Business and Information Technology (IBIT) at Temple University in 2020. The dataset was downloaded from its website and the dataset is a real world data and was porvided by Scholastic for the competition.

### What will the children’s book market look like in the future?

Scholastic is a major international publishing, educational, and media company with a focus on books and educational materials designed to support children’s literacy and cultivate a passion for reading and knowledge which will continue throughout life. Scholastic’s mission is driven by its credo LINK which articulates this goal for the company.

Scholastic has many different channels through which it distributes the books it publishes. At times these channels work collaboratively to reach new customers in different ways, but at times there can be significant overlap between these channels. Through analyzing the interactions of these channels Scholastic seeks to better understand the children’s book market, vis-a-vis demographics, geography, genre, and price.

In the spirit of Scholastic’s mission of cultivating learning, the provided data is presented in a realistic manner, as a small snapshot of Scholastic’s sales between two separate distribution channels across the nation. In an effort to make this a realistic scenario, the data presented is not fully cleaned and there are many additional interesting variables which it is the job of the analyst to identify and evaluate.

Start your analysis by answering question 1 below, and then answer at least one or more of the remaining questions.

1. What trends do you see in the data, among / demographics, genre/ theme, and price?
2. What does the data suggest about Scholastic’s distribution channels, and how would you recommend structuring a distribution strategy?
3. What other publically available data can you append (Census, state, region, etc.), to provide further insight?
4. Formulate a unified strategy for marketing between the two channels. Where are there areas of significant overlap between the channels, and what strategy do you suggest to prevent unintentional competition between channels?

Below are all features in this dataset

- Title: Title of product sold
- TITLE_CODE: Unique ID for titles
- CHANNEL: Masked channel description of channel through which the product was distributed to the customer
- PROD_TYP: Indicator if the product is a paperback or hardback
- SERIES: Y/N indicator if the product is part of a series
- CH1_GENRE: Genre listing for product from Channel 1 database
- CH1_THEME: Theme listing for product from Channel 1 database
- CH2_CATEGORY: Category listing for product from Channel 2 database
- CH2_SUBCATEGORY: Subcategory listing for product from Channel 2 database
- LEXILE_11_DESC: Lexile measures for product. Note this field is not always complete for every Scholastic product. For more information on Lexile codes, see links below:
 - : https://lexile.com/educators/measuring-growth-with-lexile/lexile-measures-grade-equivalents/
 - : https://lexile.com/educators/find-books-at-the-right-level/about-lexile-codes/
- total_units: Number of products sold 
- UNIT_PRICE: Unit price of product sold
- SCHOOL_TYPE: Indicator if the school where the product was sold was public or not.
- REGION: Region of the United States where product was distributed - NORTHEAST, MIDWEST, SOUTH, WEST, or OTHER
- STATE: US state of sale
- COUNTY: County of sale
- EDU_NO_HS: % of population with no HS degree, by zip code
- EDU_HS_SOME_COLLEGE: % of population with some college, by zip code
- EDU_BACHELOR_DEG: % of population with bachelor degree, by zip code
- EDU_GRADUATE_DEG: % of population with graduate degree, by zip code
- HHI_BAND: Bands of household income for zip code, in $10,000 bands
- ZIP_CODE: Zip code of sale

In this exercise, we will perform an exploratory analysis of the dataset. The following activities will be performed:
- read the dataset into spark (I had to change delimiter from comma to pipe.)
- run various exploratory analysis to understand better
- conduct ETL analysis
- run various queries to address the questions in this challenge

## Load the dataset and create a temp view

In [0]:
scholastic=spark.read.option("delimiter", "|").csv("/FileStore/tables/scholastic/sdata.csv", header=True, inferSchema=True)

scholastic.createOrReplaceTempView("BooksTable")


In [0]:
display(scholastic)

title,TITLE_CODE,CHANNEL,PROD_TYP,SERIES,CH1_GENRE,CH1_THEME,CH2_CATEGORY,CH2_SUBCATEGORY,LEXILE_11_DESC,total_units,UNIT_PRICE,SCHOOL_TYPE,REGION,STATE,COUNTY,EDU_NO_HS,EDU_HS_SOME_COLLEGE,EDU_BACHELOR_DEG,EDU_GRADUATE_DEG,HHI_BAND,ZIP_CODE
Dog Man: Lord of the Fleas,71797,CHANNEL 2,HARDBACK,Y,"['Humor & Funny Stories', 'Action & Adventure']","['Reluctant Reader Appeal', 'Superheroes']",GRAPHIC NOVELS,,,2,9.99,OTHER,OTHER,,,,,,,,.
Grumpy Pants,72780,CHANNEL 2,PAPERBACK,N,,"['Emotions & Feelings', 'Bedtime & Dreams', 'Penguins']",,,AD380L,1,4.95,OTHER,OTHER,,,,,,,,.
Walk and See: 123,77338,CHANNEL 2,PAPERBACK,N,,"['Counting, Numbers & Place Value', 'Fall', 'Seasons']",,,,2,4.95,OTHER,OTHER,,,,,,,,.
Dog Man: Lord of the Fleas,71797,CHANNEL 2,HARDBACK,Y,"['Humor & Funny Stories', 'Action & Adventure']","['Reluctant Reader Appeal', 'Superheroes']",GRAPHIC NOVELS,,,1,9.99,OTHER,OTHER,,,,,,,,.
Pete the Cat and the Missing Cupcakes,75211,CHANNEL 2,PAPERBACK,N,,"['Kindness', 'Cooking & Food', 'Friendship']",PICTURE,PAPERBACK BOOK,AD440L,1,6.95,OTHER,OTHER,,,,,,,,.
When I Grow Up,77501,CHANNEL 2,PAPERBACK,N,"['Poetry, Songs & Verse']",['Growing Up'],,,,1,5.95,OTHER,OTHER,,,,,,,,.
"Grow Up, David!",72771,CHANNEL 2,HARDBACK,N,['Humor & Funny Stories'],"['Brothers & Sisters', 'Family Life', 'Behavior & Manners']",PICTURE,HC/POB BOOK,,1,17.99,OTHER,OTHER,,,,,,,,.
Amelia Bedelia on the Move,70281,CHANNEL 2,PAPERBACK,N,['Humor & Funny Stories'],"['Reluctant Reader Appeal', 'Family Life']",,,,2,4.0,OTHER,OTHER,,,,,,,,.
I Am Jane Goodall,73201,CHANNEL 2,PAPERBACK,N,['Biography & Autobiography'],"['Scientists & Inventors', 'Women', 'Science & Nature', 'Reluctant Reader Appeal']",INSTRUCTION RESOURCE DIVISION,PRIMARY NONFICTION,580L,1,5.5,OTHER,OTHER,,,,,,,,.
Escape from Shudder Mansion,72020,CHANNEL 2,PAPERBACK,Y,['Horror & Supernatural'],"['Monsters & Ghosts', 'Reluctant Reader Appeal']",MYSTERY,,,1,6.99,OTHER,OTHER,,,,,,,,.


## Take Home Assignment

For take home, please answer the following questions:
 - Develop at least five queries to understand trends you see in the data. What does the data suggest about Scholastic’s distribution channels, and how would you recommend structuring a distribution strategy? 
 - What other publically available data can you append (Census, state, region, etc.), to provide further insight? Please import at least one dataset in your analysis. 

Please develop proper Spark SQL queries for each query, visualize the result and write a few paragraphs to discuss your findings. You need to label each question and submit the completed notebook  with all visulizations (in html format) by the due date.