# SQL Project
You were hired by Ironhack to perform an Analytics Consulting Project entitled: competitive landscape.

Your mission is to create and populate an appropriate database with many coding schools that are our competition, as well as design an suitable queries that answer business questions of interest (to be defined by you)


**Suggested Steps in the Project:**


*   Read this notebook and understand each function. Comment the code appropriately

*   Populate the list of schools with a wider variety of schools (how are you going to get the school ID?)

* Take a look at the obtained dataframes. What dimensions do you have? what keys do you have? how could the different dataframes be connected?

* Go back to the drawing board and try to create an entity relationship diagram for tables available

* Once you have the schemas you want, you will need to:
  - create the suitable SQL queries to create the tables and populate them
  - run these queries using the appropriate Python connectors
  
* Bonus: How will this datamodel be updated in the future? Please write auxiliary functions that test the database for data quality issues. For example: how could you make sure you only include the most recent comments when you re-run the script?


# Suggested Deliverables

* 5-6 minute presentation of data model created, decision process and business analysis proposed

* exported .sql file with the final schema

* Supporting python files used to generate all logic

* High level documentation explaining tables designed and focusing on update methods

Crucial hint: check out the following tutorial:
https://www.dataquest.io/blog/sql-insert-tutorial/


In [43]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
from pandas import json_normalize
import re
import numpy as np
import time
import multiprocessing
import concurrent

In [17]:
school_list = ['Springboard', 
             'Dataquest',
             'Syntax Technologies',
             'ironhack',
             'tripleten',
             'Colaberry',
             'Maven Analytics',
             'Udacity',
             'BrainStation',
             'CCS Learning Academy',
             'Thinkful',
             'General Assembly']

In [32]:
school_id_dict = {}

In [34]:
for school_name in school_list:
    url = f"https://www.switchup.org/bootcamps/{school_name.lower().replace(' ', '-')}"
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        page_data_tag = soup.find('page-data')

        if page_data_tag:
            school_id = page_data_tag.get('school-id')
            if school_id:
                school_id_dict[school_name.lower().replace(' ','-')] = int(school_id)
                print(f"school ID for : {school_name}/{school_id}")
            else:
                print(f"School ID not found: {school_name}")
        else:
            print(f"Page data not found: {school_name}")

    else:
        print(f"Error: {response.status_code}")

print("School ID dictionary:", school_id_dict)

school ID for : Springboard/11035
school ID for : Dataquest/10683
school ID for : Syntax Technologies/11797
school ID for : ironhack/10828
school ID for : tripleten/11225
school ID for : Colaberry/11718
school ID for : Maven Analytics/11740
school ID for : Udacity/11118
school ID for : BrainStation/10571
school ID for : CCS Learning Academy/11736
school ID for : Thinkful/11098
school ID for : General Assembly/10761
School ID dictionary: {'springboard': 11035, 'dataquest': 10683, 'syntax-technologies': 11797, 'ironhack': 10828, 'tripleten': 11225, 'colaberry': 11718, 'maven-analytics': 11740, 'udacity': 11118, 'brainstation': 10571, 'ccs-learning-academy': 11736, 'thinkful': 11098, 'general-assembly': 10761}


In [35]:
school_id_dict

{'springboard': 11035,
 'dataquest': 10683,
 'syntax-technologies': 11797,
 'ironhack': 10828,
 'tripleten': 11225,
 'colaberry': 11718,
 'maven-analytics': 11740,
 'udacity': 11118,
 'brainstation': 10571,
 'ccs-learning-academy': 11736,
 'thinkful': 11098,
 'general-assembly': 10761}

In [36]:
schools = school_id_dict

In [37]:
def get_comments_school(school):
    TAG_RE = re.compile(r'<[^>]+>')
    # defines url to make api call to data -> dynamic with school if you want to scrape competition
    url = "https://www.switchup.org/chimera/v1/school-review-list?mainTemplate=school-review-list&path=%2Fbootcamps%2F" + school + "&isDataTarget=false&page=3&perPage=10000&simpleHtml=true&truncationLength=250"
    #makes get request and converts answer to json
    # url defines the page of all the information, request is made, and information is returned to data variable
    data = requests.get(url).json()
    #converts json to dataframe
    reviews =  pd.DataFrame(data['content']['reviews'])
  
    #aux function to apply regex and remove tags
    def remove_tags(x):
        return TAG_RE.sub('',x)
    reviews['review_body'] = reviews['body'].apply(remove_tags)
    reviews['school'] = school
    return reviews

In [38]:
schools

{'springboard': 11035,
 'dataquest': 10683,
 'syntax-technologies': 11797,
 'ironhack': 10828,
 'tripleten': 11225,
 'colaberry': 11718,
 'maven-analytics': 11740,
 'udacity': 11118,
 'brainstation': 10571,
 'ccs-learning-academy': 11736,
 'thinkful': 11098,
 'general-assembly': 10761}

In [39]:
# could you write this as a list comprehension? ;)
comments = []

for school in schools.keys():
    print(school)
    comments.append(get_comments_school(school))
comments = pd.concat(comments)

springboard
dataquest
syntax-technologies
ironhack
tripleten
colaberry
maven-analytics
udacity
brainstation
ccs-learning-academy
thinkful
general-assembly


In [40]:
comments

Unnamed: 0,id,name,anonymous,hostProgramName,graduatingYear,isAlumni,jobTitle,tagline,body,rawBody,...,queryDate,program,user,overallScore,comments,overall,curriculum,jobSupport,review_body,school
0,306549,Daniel Dluzynski,False,,2023.0,False,,Extensive and well built curriculum,"<span class=""truncatable""><p></p><p>This cours...",<p>This course is great for beginners. The cur...,...,2023-11-17,Cyber Security Career Track,{'image': None},4.3,[],4.0,5.0,4.0,This course is great for beginners. The curric...,springboard
1,306505,Jonathan Chiu,False,,2023.0,False,,Join if you're looking to structure &amp; Netw...,"<span class=""truncatable""><p></p><p>If you fin...",<p>If you find yourself unsure of where to beg...,...,2023-11-15,UI/UX Design Career Track,{'image': None},4.0,[],4.0,4.0,4.0,"If you find yourself unsure of where to begin,...",springboard
2,306504,Anonymous,True,,2023.0,False,,Join if you're looking to structure &amp; Netw...,"<span class=""truncatable""><p></p><p>If you fin...",<p>If you find yourself unsure of where to beg...,...,2023-11-15,UI/UX Design Career Track,{'image': None},4.0,[],4.0,4.0,4.0,"If you find yourself unsure of where to begin,...",springboard
3,306451,Anonymous,True,,2023.0,True,UX/UI Design,Wonderful,"<span class=""truncatable""><p></p><p>Pros: I fo...",<p>Pros: I found the Springboard bootcamp to b...,...,2023-11-14,UI/UX Design Career Track,{'image': None},4.3,[],4.0,5.0,4.0,Pros: I found the Springboard bootcamp to be i...,springboard
4,306317,Anonymous,True,,2023.0,False,Tech Sales,My experience at Springboard,"<span class=""truncatable""><p></p><p>My experie...",<p>My experience at Springboard was great. Won...,...,2023-11-08,Tech Sales Career Track,{'image': None},5.0,[],5.0,5.0,5.0,My experience at Springboard was great. Wonder...,springboard
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
994,231691,Abby Howell,False,,2013.0,True,Software Developer at Cengage Learning,From 2nd grade teacher to full-stack web devel...,"<span class=""truncatable""><p>My experience at ...",My experience at General Assembly's Web Develo...,...,2014-06-29,Software Engineering Immersive,{'image': None},5.0,[],5.0,5.0,5.0,My experience at General Assembly's Web Develo...,general-assembly
995,231827,Anonymous,False,,,False,,What you get out of the program really depends...,"<span class=""truncatable""><p></p><p>What you g...",<p>What you get out of the program really depe...,...,2014-06-15,,{'image': None},3.0,[],3.0,,,What you get out of the program really depends...,general-assembly
996,231816,Thomas Berry,False,,,False,,The bitmaker program provides opportunities an...,"<span class=""truncatable""><p></p><p>Personally...",<p>Personally I had a great experience at Bitm...,...,2014-06-15,,{'image': None},5.0,[],5.0,,,Personally I had a great experience at Bitmake...,general-assembly
997,231836,Ryan Racioppo,False,,,False,,Bitmaker is the best way to motivate and accel...,"<span class=""truncatable""><p></p><p>I was in t...",<p>I was in the 3rd cohort and have had a succ...,...,2014-06-15,,{'image': None},5.0,[],5.0,,,I was in the 3rd cohort and have had a success...,general-assembly


In [41]:
from pandas import json_normalize

def get_school_info(school, school_id):
    url = 'https://www.switchup.org/chimera/v1/bootcamp-data?mainTemplate=bootcamp-data%2Fdescription&path=%2Fbootcamps%2F'+ str(school) + '&isDataTarget=false&bootcampId='+ str(school_id) + '&logoTag=logo&truncationLength=250&readMoreOmission=...&readMoreText=Read%20More&readLessText=Read%20Less'

    data = requests.get(url).json()

    data.keys()

    courses = data['content']['courses']
    courses_df = pd.DataFrame(courses, columns= ['courses'])

    locations = data['content']['locations']
    locations_df = json_normalize(locations)

    badges_df = pd.DataFrame(data['content']['meritBadges'])
    
    website = data['content']['webaddr']
    description = data['content']['description']
    logoUrl = data['content']['logoUrl']
    school_df = pd.DataFrame([website,description,logoUrl]).T
    school_df.columns =  ['website','description','LogoUrl']

    locations_df['school'] = school
    courses_df['school'] = school
    badges_df['school'] = school
    school_df['school'] = school
    

    locations_df['school_id'] = school_id
    courses_df['school_id'] = school_id
    badges_df['school_id'] = school_id
    school_df['school_id'] = school_id

    return locations_df, courses_df, badges_df, school_df

locations_list = []
courses_list = []
badges_list = []
schools_list = []

for school, id in schools.items():
    print(school)
    a,b,c,d = get_school_info(school,id)
    
    locations_list.append(a)
    courses_list.append(b)
    badges_list.append(c)
    schools_list.append(d)



springboard
dataquest
syntax-technologies
ironhack
tripleten
colaberry
maven-analytics
udacity
brainstation
ccs-learning-academy
thinkful
general-assembly


In [50]:
locations_list

Unnamed: 0,id,description,state.id,state.name,state.abbrev,state.keyword,school,school_id
0,17154,Online,1,Online,Online,online,tripleten,11225


In [51]:
locations = pd.concat(locations_list)
locations

Unnamed: 0,id,description,state.id,state.name,state.abbrev,state.keyword,school,school_id,country.id,country.name,country.abbrev,city.id,city.name,city.keyword
0,16013,Online,1.0,Online,Online,online,springboard,11035,,,,,,
0,16378,Online,1.0,Online,Online,online,dataquest,10683,,,,,,
0,18261,Online,1.0,Online,Online,online,syntax-technologies,11797,,,,,,
0,15901,"Berlin, Germany",,,,,ironhack,10828,57.0,Germany,DE,31156.0,Berlin,berlin
1,16022,"Mexico City, Mexico",,,,,ironhack,10828,29.0,Mexico,MX,31175.0,Mexico City,mexico-city
2,16086,"Amsterdam, Netherlands",,,,,ironhack,10828,59.0,Netherlands,NL,31168.0,Amsterdam,amsterdam
3,16088,"Sao Paulo, Brazil",,,,,ironhack,10828,42.0,Brazil,BR,31121.0,Sao Paulo,sao-paulo
4,16109,"Paris, France",,,,,ironhack,10828,38.0,France,FR,31136.0,Paris,paris
5,16375,"Miami, FL, United States",11.0,Florida,FL,florida,ironhack,10828,1.0,United States,US,31.0,Miami,miami
6,16376,"Madrid, Spain",,,,,ironhack,10828,12.0,Spain,ES,31052.0,Madrid,madrid


In [52]:
courses = pd.concat(courses_list)
courses.head(10)

Unnamed: 0,courses,school,school_id
0,Cyber Security Career Track,springboard,11035
1,Data Analytics Career Track,springboard,11035
2,Data Science Career Track,springboard,11035
3,Data Science Career Track Prep,springboard,11035
4,Front-End Web Development,springboard,11035
5,Introduction to Data Analytics,springboard,11035
6,Introduction to Design,springboard,11035
7,Software Engineering Career Track,springboard,11035
8,Software Engineering Career Track Prep Course,springboard,11035
9,Software Engineering Foundations to Core,springboard,11035


In [53]:
badges = pd.concat(badges_list)
badges.head()

Unnamed: 0,name,keyword,description,school,school_id
0,Available Online,available_online,<p>School offers fully online courses</p>,springboard,11035
1,Flexible Classes,flexible_classes,<p>School offers part-time and evening classes...,springboard,11035
2,Job Guarantee,job_guarantee,<p>School guarantees job placement</p>,springboard,11035
0,Available Online,available_online,<p>School offers fully online courses</p>,dataquest,10683
1,Flexible Classes,flexible_classes,<p>School offers part-time and evening classes...,dataquest,10683


In [57]:
# any data cleaning still missing here? take a look at the description
schools = pd.concat(schools_list)
schools

Unnamed: 0,website,description,LogoUrl,school,school_id
0,www.springboard.com/?utm_source=switchup&utm_m...,"<span class=""truncatable""><p>Springboard is an...",https://d92mrp7hetgfk.cloudfront.net/images/si...,springboard,11035
0,www.dataquest.io,"<span class=""truncatable""><p>Master data skill...",https://d92mrp7hetgfk.cloudfront.net/images/si...,dataquest,10683
0,www.syntaxtechs.com/,"<span class=""truncatable""><p>Syntax Technologi...",https://d92mrp7hetgfk.cloudfront.net/images/si...,syntax-technologies,11797
0,www.ironhack.com/en,"<span class=""truncatable""><p>Ironhack is a glo...",https://d92mrp7hetgfk.cloudfront.net/images/si...,ironhack,10828
0,tripleten.com/?utm_source=referral&utm_medium=...,"<span class=""truncatable""><p>Tripleten changed...",https://d92mrp7hetgfk.cloudfront.net/images/si...,tripleten,11225
0,www.colaberry.com/,"<span class=""truncatable""><p>Colaberry offers ...",https://d92mrp7hetgfk.cloudfront.net/images/si...,colaberry,11718
0,www.mavenanalytics.io/,"<span class=""truncatable""><p>Maven Analytics i...",https://d92mrp7hetgfk.cloudfront.net/images/si...,maven-analytics,11740
0,www.udacity.com/?utm_source=switchup&utm_mediu...,"<span class=""truncatable""><p>Udacity is the tr...",https://d92mrp7hetgfk.cloudfront.net/images/si...,udacity,11118
0,brainstation.io,"<span class=""truncatable""><p>BrainStation is t...",https://d92mrp7hetgfk.cloudfront.net/images/si...,brainstation,10571
0,ccslearningacademy.com/,"<span class=""truncatable""><p>TECH TRAINING BY ...",https://d92mrp7hetgfk.cloudfront.net/images/si...,ccs-learning-academy,11736
