# Choose a Data Set

Create your own dataset by scraping one of the following websites *(level 5)*:
- [Wikipedia](https://www.wikipedia.org/)
- [OpenLibrary](https://openlibrary.org/)

**OR** 

Use data gathered from one of the following APIs *(level 4)*: 
- [TMDB](https://developer.themoviedb.org/reference/intro/getting-started)
- [College Scorecard](https://collegescorecard.ed.gov/data/api-documentation/)

**OR** 

Pick a JSON dataset *(level 3)*:
- [Food/Restaurant Data](https://drive.google.com/drive/folders/1V94S6WpclvQmbnW88KVMD4EruryA1oma?usp=drive_link)
- [Fashion Data](https://drive.google.com/drive/folders/1V8SbFjtRRW8WVf3xBzg0gzLjOtMhHea_?usp=drive_link)

**OR** 

Pick a CSV dataset *(level 2)*:
- [LA Parking Tickets](https://drive.google.com/drive/folders/1vaOfwMi6QmZEGsXr8VM0ulPGzvTTBCgm?usp=drive_link)
- [Hotels](https://drive.google.com/drive/folders/1IpVFxgwBJvJHKoOuBsk6WK2qYqFYP4hi?usp=drive_link)

# My Question

What is the average rating for a book in each country and what is the most high-rated year-range for each country (United State China and Russa)

# My Answer

In [1]:
import bs4 as bs
import random
import pandas as pd
import seaborn as sb
import numpy as np
import datetime
import requests as rq
import pickle
import json
import time
from collections import Counter
import matplotlib.pyplot as plt
# PVLAES4RV3LWZCBE

def save_pickle(file, value):
    pickle.dump(value, open(file, "wb"))
def load_pickle(file):
    return pickle.load(open(file, "rb"))
"""
pages = []
for x in range(1,11):
    URL = "https://openlibrary.org/search?q=United States&mode=everything&sort=rating&page=" + str(x)
    page = rq.get(URL)
    pages.append(bs.BeautifulSoup(page.content, "html.parser"))
save_pickle("UnitedStates.p", pages)

pages = []
for x in range(1,11):
    URL = "https://openlibrary.org/search?q=China&mode=everything&sort=rating&page=" + str(x)
    page = rq.get(URL)
    pages.append(bs.BeautifulSoup(page.content, "html.parser"))
save_pickle("China.p", pages)

pages = []
for x in range(1,11):
    URL = "https://openlibrary.org/search?q=Russia&mode=everything&sort=rating&page=" + str(x)
    page = rq.get(URL)
    pages.append(bs.BeautifulSoup(page.content, "html.parser"))
save_pickle("Russia.p", pages)
"""

countrys = {"us": load_pickle("UnitedStates.p"), "ch": load_pickle("China.p"), "ru": load_pickle("Russia.p")}
countrysCombined = {"us": [], "ch": [], "ru": []}

for key in ['us', 'ch', 'ru']:
    for y in range(1,10):
        #countrysCombined[key].append(countrys[key][y].find_all("li", {"itemtype": "https://schema.org/Book"}))
        for z in countrys[key][y].find_all("li", {"itemtype": "https://schema.org/Book"}):
            countrysCombined[key].append(z)

books = {"us": {"title": [], "author": [], "rating": [], "year": []}, "ch": {"title": [], "author": [], "rating": [], "year": []}, "ru": {"title": [], "author": [], "rating": [], "year": []}}

for country in ["us", "ch", "ru"]:
    for book in countrysCombined[country]:
        title = book.find("a", {"itemprop": "url"}).text

        try:
            author = book.find("span", {"itemprop": "author"}).find("a").text
        except:
            author = "unknown"
        
        try:
            year = book.find("span", {"class": "resultDetails"}).find("span").text.replace("First published in ", "")
        except:
            year = "unknown"

        try:
            rating = book.find("span", {"itemprop": "ratingValue"}).text[:3] # :3
        except:
            rating = "unknown"

        if author != "unknown" and year != "unknown" and rating != "unknown":
            fail = False
            try:
                int(year)
                float(rating)
            except:
                fail = True
            if fail != True:
                books[country]['title'].append(title)
                books[country]['author'].append(author)
                books[country]['year'].append(int(year))
                books[country]['rating'].append(float(rating))
        
save_pickle("books.p", books)

In [95]:
books = load_pickle("books.p")

for country in ["us", "ch", "ru"]:
    print(sum(books[country]['rating'])/len(books[country]['rating']))

4.323863636363637
4.605263157894737
3.9766355140186915


In [142]:
books = load_pickle("books.p")

yearsCountry = {"us": [], "ch": [], "ru": []}
for country in ["us", "ch", "ru"]:
    years = {}
    for x in range(int(str(min(books[country]['year']))[:2]),21):
        for book in range(len(books[country]['year'])):
            if int(str(books[country]['year'][book])[:2]) == x:
                try:
                    years[x]
                except:
                    years[x] = []

                years[x].append(books['us']['rating'][book])

    yearsCountry[country] = years

                #print(int(str(books['us']['year'][book])[:2]))

#print(yearsCountry)

for country in yearsCountry:
    for year in yearsCountry[country]:
        
        print("average rating for the century " + str(year) + "00 in the country " + country + " is " + str(sum(yearsCountry[country][year])/len(yearsCountry[country][year])))
    print()


average rating for the century 1400 in the country us is 3.7
average rating for the century 1500 in the country us is 3.95
average rating for the century 1600 in the country us is 3.55
average rating for the century 1700 in the country us is 4.2727272727272725
average rating for the century 1800 in the country us is 4.058620689655172
average rating for the century 1900 in the country us is 4.363636363636363
average rating for the century 2000 in the country us is 4.494230769230769

average rating for the century 1500 in the country ch is 5.0
average rating for the century 1600 in the country ch is 4.199999999999999
average rating for the century 1700 in the country ch is 3.85
average rating for the century 1800 in the country ch is 4.315384615384615
average rating for the century 1900 in the country ch is 4.408
average rating for the century 2000 in the country ch is 4.217105263157895

average rating for the century 1700 in the country ru is 4.0
average rating for the century 1800 in t

***Describe analysis here.***

The answer to my question is the average rating per country in the US China and Russia is 
4.323863636363637
4.605263157894737
3.9766355140186915
stars respectivally

The answer to the question of what is the average rating per year range in each country is 
average rating for the century 1400 in the country us is 3.7
average rating for the century 1500 in the country us is 3.95
average rating for the century 1600 in the country us is 3.55
average rating for the century 1700 in the country us is 4.2727272727272725
average rating for the century 1800 in the country us is 4.058620689655172
average rating for the century 1900 in the country us is 4.363636363636363
average rating for the century 2000 in the country us is 4.494230769230769

average rating for the century 1500 in the country ch is 5.0
average rating for the century 1600 in the country ch is 4.199999999999999
average rating for the century 1700 in the country ch is 3.85
average rating for the century 1800 in the country ch is 4.315384615384615
average rating for the century 1900 in the country ch is 4.408
average rating for the century 2000 in the country ch is 4.217105263157895

average rating for the century 1700 in the country ru is 4.0
average rating for the century 1800 in the country ru is 3.6
average rating for the century 1900 in the country ru is 4.307317073170732
average rating for the century 2000 in the country ru is 4.138709677419355



***Describe analysis here.***

In [3]:
# Add more code/markdown cells here if you need them.