# SQL Project
You were hired by Ironhack to perform an Analytics Consulting Project entitled: competitive landscape.

Your mission is to create and populate an appropriate database with many coding schools that are our competition, as well as design an suitable queries that answer business questions of interest (to be defined by you)


**Suggested Steps in the Project:**


*   Read this notebook and understand each function. Comment the code appropriately

*   Populate the list of schools with a wider variety of schools (how are you going to get the school ID?)

* Take a look at the obtained dataframes. What dimensions do you have? what keys do you have? how could the different dataframes be connected?

* Go back to the drawing board and try to create an entity relationship diagram for tables available

* Once you have the schemas you want, you will need to:
  - create the suitable SQL queries to create the tables and populate them
  - run these queries using the appropriate Python connectors
  
* Bonus: How will this datamodel be updated in the future? Please write auxiliary functions that test the database for data quality issues. For example: how could you make sure you only include the most recent comments when you re-run the script?


# Suggested Deliverables

* 7 minute presentation of data model created, decision process and business analysis proposed

* exported .sql file with the final schema

* Supporting python files used to generate all logic

* High level documentation explaining tables designed and focusing on update methods

Crucial hint: check out the following tutorial:
https://www.dataquest.io/blog/sql-insert-tutorial/


In [10]:
schools_main = {'le-wagon': '10868',
 'springboard': '11035',
 'udacity': '11118',
 'shecodes': '11014',
 'ironhack': '10828',
 'app-academy': '10525',
 'designlab': '10697',
 'nucamp': '10923',
 'thinkful': '11098',
 'software-development-academy': '11030',
 'coding-dojo': '10659',
 'makers-academy': '10874',
 'product-gym': '10959',
 'actualize': '10505',
 'edureka': '11739',
 'careerist': '11280',
 'simplilearn': '11016',
 'nyc-data-science-academy': '10925',
 'the-tech-academy': '11091',
 'la-capsule': '10853',
 'acte': '11735',
 'brainstation': '10571',
 'hack-reactor': '10788',
 'codesmith': '10643',
 'dataquest': '10683',
 'jedha': '10837',
 'greyatom-school-of-data-science': '10776',
 'academia-de-codigo': '10494',
 'microverse': '10888',
 'juno-college-of-technology': '10787',
 'datascientest': '11232',
 'clarusway': '11539',
 'product-school': '10960',
 'knowledgehut': '10846',
 'hyperiondev': '10801',
 'tech-elevator': '11056',
 'evolve-security-academy': '10743',
 'school-of-it': '11006',
 'code-institute': '10619',
 'vertical-institute': '11241',
 'isdi-coders': '11024',
 'tripleten': '11225',
 'altcademy': '10517',
 'happyer-skills': '11518',
 'bloomtech': '10854',
 'devcodecamp': '10703',
 'xccelerate': '11175',
 'developers-institute': '10705',
 'neoland': '10906',
 'careerfoundry': '10581',
 '4geeks-academy': '10492',
 'learningfuze': '10862',
 'galvanize': '10754',
 'coding-temple': '10664',
 'hacktiv8': '10791',
 'codeworks': '10650',
 'wbs-coding-school': '11243',
 'ubiqum-code-academy': '11111',
 'digitalcrafts': '10719',
 'ux-design-institute': '11150',
 'code-fellows': '10614',
 'yellow-tail-tech': '11545',
 'fullstack-academy': '10751',
 'claim-academy': '10589',
 'rocket-academy': '11483',
 'skillcrush': '11020',
 'rmotr': '10987'}

import re
import pandas as pd
from pandas.io.json import json_normalize
import requests



def get_comments_school(school):
    TAG_RE = re.compile(r'<[^>]+>')
    # defines url to make api call to data -> dynamic with school if you want to scrape competition
    url = "https://www.switchup.org/chimera/v1/school-review-list?mainTemplate=school-review-list&path=%2Fbootcamps%2F" + school + "&isDataTarget=false&page=3&perPage=10000&simpleHtml=true&truncationLength=250"
    #makes get request and converts answer to json
    # url defines the page of all the information, request is made, and information is returned to data variable
    data = requests.get(url).json()
    #converts json to dataframe
    reviews =  pd.DataFrame(data['content']['reviews'])

    #aux function to apply regex and remove tags
    def remove_tags(x):
        return TAG_RE.sub('',x)
    reviews['review_body'] = reviews['body'].apply(remove_tags)
    reviews['school'] = school
    return reviews

In [39]:
TAG_RE = re.compile(r'<[^>]+>')
def remove_tags(x):
    return TAG_RE.sub('',x)

In [11]:
comments = [get_comments_school(i) for i in schools_main.keys()]

comments = pd.concat(comments)

In [12]:
comments

Unnamed: 0,id,name,anonymous,hostProgramName,graduatingYear,isAlumni,jobTitle,tagline,body,rawBody,...,queryDate,program,user,overallScore,comments,overall,curriculum,jobSupport,review_body,school
0,305425,Minnerva Sasu,False,,2023.0,False,,Amazing!!!,"<span class=""truncatable""><p></p><p>In 9 weeks...","<p>In 9 weeks, I went from procrastinating to ...",...,2023-10-05,Data Science & AI - Full-Time,{'image': None},5.0,[],5.0,5.0,5.0,"In 9 weeks, I went from procrastinating to doi...",le-wagon
1,305313,Alexander Pegot,False,,2022.0,False,Data Analyst,Great Experience,"<span class=""truncatable""><p></p><p>After maki...",<p>After making the decision to pursue a caree...,...,2023-10-02,Data Science & AI - Full-Time,{'image': None},5.0,[],5.0,5.0,5.0,After making the decision to pursue a career i...,le-wagon
2,305209,Anonymous,True,,2023.0,False,,Incredible bootcamp!,"<span class=""truncatable""><p></p><p>I feel ver...",<p>I feel very confident in my programming ski...,...,2023-09-27,Web Development - Full-Time,{'image': None},5.0,[],5.0,5.0,5.0,I feel very confident in my programming skills...,le-wagon
3,305205,Anonymous,True,,2023.0,False,,That was amazing experience at Le wagon,"<span class=""truncatable""><p></p><p>I didn't r...",<p>I didn&#39;t really have coding knowledge i...,...,2023-09-27,Web Development - Full-Time,{'image': None},5.0,[],5.0,5.0,5.0,I didn't really have coding knowledge initiall...,le-wagon
4,305099,Silvia Santillan Nava,False,,2023.0,False,Senior BDR,Data Analytics proficient in 9 weeks,"<span class=""truncatable""><p></p><p>The online...",<p>The online data analytics bootcamp offered ...,...,2023-09-22,Data Analytics - Full-Time,{'image': None},5.0,[],5.0,5.0,5.0,The online data analytics bootcamp offered a t...,le-wagon
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137,232119,Akshith Yellapragada,False,,2015.0,True,,Highly recommended! But not for beginners.,"<span class=""truncatable""><p>Getting through t...",Getting through the beginning of learning to c...,...,2017-07-31,Data Science,{'image': None},5.0,[],5.0,5.0,5.0,Getting through the beginning of learning to c...,rmotr
138,237029,Yatri Trivedi,False,,2016.0,True,,A stellar course for budding programmers,"<span class=""truncatable""><p>I found this cour...","I found this course on reddit and applied, thi...",...,2017-07-31,Data Science,{'image': None},5.0,[],5.0,5.0,5.0,"I found this course on reddit and applied, thi...",rmotr
139,232047,Jonathan Hartford,False,,2015.0,True,,Advanced Python Programming,"<span class=""truncatable""><p>This is an excell...",This is an excellent course. Working with oth...,...,2017-07-30,Data Science,{'image': None},5.0,[],5.0,5.0,5.0,This is an excellent course. Working with othe...,rmotr
140,235529,Vojtech Kotek,False,,2017.0,True,https://github.com/vkotek,Awesome course!,<p>This course gave me a lot of knew knowledge...,This course gave me a lot of knew knowledge ab...,...,2017-03-29,Data Science,{'image': None},5.0,[],5.0,5.0,5.0,This course gave me a lot of knew knowledge ab...,rmotr


In [17]:
comments["school"].value_counts()

le-wagon             2730
springboard          1552
udacity              1405
shecodes             1397
ironhack             1283
                     ... 
fullstack-academy     153
claim-academy         152
rocket-academy        149
skillcrush            145
rmotr                 142
Name: school, Length: 67, dtype: int64

In [15]:
comments.columns

Index(['id', 'name', 'anonymous', 'hostProgramName', 'graduatingYear',
       'isAlumni', 'jobTitle', 'tagline', 'body', 'rawBody', 'createdAt',
       'queryDate', 'program', 'user', 'overallScore', 'comments', 'overall',
       'curriculum', 'jobSupport', 'review_body', 'school'],
      dtype='object')

In [18]:
from pandas.io.json import json_normalize

def get_school_info(school, school_id):
    url = 'https://www.switchup.org/chimera/v1/bootcamp-data?mainTemplate=bootcamp-data%2Fdescription&path=%2Fbootcamps%2F'+ str(school) + '&isDataTarget=false&bootcampId='+ str(school_id) + '&logoTag=logo&truncationLength=250&readMoreOmission=...&readMoreText=Read%20More&readLessText=Read%20Less'

    data = requests.get(url).json()

    data.keys()

    courses = data['content']['courses']
    courses_df = pd.DataFrame(courses, columns= ['courses'])

    locations = data['content']['locations']
    locations_df = json_normalize(locations)

    badges_df = pd.DataFrame(data['content']['meritBadges'])

    website = data['content']['webaddr']
    description = data['content']['description']
    logoUrl = data['content']['logoUrl']
    school_df = pd.DataFrame([website,description,logoUrl]).T
    school_df.columns =  ['website','description','LogoUrl']

    locations_df['school'] = school
    courses_df['school'] = school
    badges_df['school'] = school
    school_df['school'] = school


    locations_df['school_id'] = school_id
    courses_df['school_id'] = school_id
    badges_df['school_id'] = school_id
    school_df['school_id'] = school_id

    return locations_df, courses_df, badges_df, school_df

locations_list = []
courses_list = []
badges_list = []
schools_list = []

for school, id in schools_main.items():
    a,b,c,d = get_school_info(school,id)

    locations_list.append(a)
    courses_list.append(b)
    badges_list.append(c)
    schools_list.append(d)



  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations

  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations_df = json_normalize(locations)
  locations

In [19]:
locations = pd.concat(locations_list)
locations

Unnamed: 0,id,description,country.id,country.name,country.abbrev,city.id,city.name,city.keyword,state.id,state.name,state.abbrev,state.keyword,school,school_id
0,15803,"Melbourne, Australia",20.0,Australia,AU,31174.0,Melbourne,melbourne,,,,,le-wagon,10868
1,15904,"Casablanca, Morocco",44.0,Morocco,MA,31119.0,Casablanca,casablanca,,,,,le-wagon,10868
2,15906,"Buenos Aires, Argentina",60.0,Argentina,AR,31171.0,Buenos Aires,buenos-aires,,,,,le-wagon,10868
3,15964,"Brussels, Belgium",46.0,Belgium,BE,31125.0,Brussels,brussels,,,,,le-wagon,10868
4,16039,"Mexico City, Mexico",29.0,Mexico,MX,31175.0,Mexico City,mexico-city,,,,,le-wagon,10868
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,17703,Singapore,56.0,Singapore,SG,31154.0,Singapore,singapore,,,,,rocket-academy,11483
2,18140,"Kowloon Bay, Hong Kong",144.0,Hong Kong,HK,31280.0,Kowloon Bay,kowloon-bay,,,,,rocket-academy,11483
3,18141,"Sydney, Australia",20.0,Australia,AU,31110.0,Sydney,sydney,,,,,rocket-academy,11483
0,15796,Online,,,,,,,1.0,Online,Online,online,skillcrush,11020


In [45]:
courses = pd.concat(courses_list)
courses

Unnamed: 0,courses,school,school_id
0,Data Analytics - Full-Time,le-wagon,10868
1,Data Analytics - Part-Time,le-wagon,10868
2,Data Engineering - Full-Time,le-wagon,10868
3,Data Engineering - Part-Time,le-wagon,10868
4,Data Science & AI - Full-Time,le-wagon,10868
...,...,...,...
0,Break Into Tech + Job Guarantee: Designer Track,skillcrush,11020
1,Break Into Tech + Job Guarantee: Front End Dev...,skillcrush,11020
2,Skillcrush Coding Camp,skillcrush,11020
0,Data Science with Python,rmotr,10987


In [42]:
badges = pd.concat(badges_list)
badges["description"] = badges["description"].apply(remove_tags)
badges["description"]

0             School offers fully online courses
1    School offers part-time and evening classes
0             School offers fully online courses
1    School offers part-time and evening classes
2                School guarantees job placement
                        ...                     
0             School offers fully online courses
1    School offers part-time and evening classes
2                School guarantees job placement
0             School offers fully online courses
1    School offers part-time and evening classes
Name: description, Length: 161, dtype: object

In [29]:
schools = pd.concat(schools_list)

def schools_cleaning(row):
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(row)
    text = [tag.get_text() for tag in soup.find_all(['p', 'span'])]
    return ' '.join(text)

schools["description"] = schools["description"].apply(schools_cleaning)

In [34]:
schools = pd.merge(schools_main, schools, how='outer', on="school_id").set_index("school").reset_index()
schools

Unnamed: 0,school,school_id,website,description,LogoUrl
0,le-wagon,10868,www.lewagon.com,Le Wagon is a global leader in immersive tech ...,https://d92mrp7hetgfk.cloudfront.net/images/si...
1,springboard,11035,www.springboard.com/?utm_source=switchup&utm_m...,Springboard is an online learning platform tha...,https://d92mrp7hetgfk.cloudfront.net/images/si...
2,udacity,11118,www.udacity.com/?utm_source=switchup&utm_mediu...,Udacity is the trusted market leader in talent...,https://d92mrp7hetgfk.cloudfront.net/images/si...
3,shecodes,11014,shecodes.io,SheCodes is a coding school that offers online...,https://d92mrp7hetgfk.cloudfront.net/images/si...
4,ironhack,10828,www.ironhack.com/en,Ironhack is a global tech school with 9 campus...,https://d92mrp7hetgfk.cloudfront.net/images/si...
...,...,...,...,...,...
62,fullstack-academy,10751,www.fullstackacademy.com,"Founded in 2012, Fullstack Academy is one of t...",https://d92mrp7hetgfk.cloudfront.net/images/si...
63,claim-academy,10589,claimacademystl.com,Claim Academy is a premier developer boot camp...,https://d92mrp7hetgfk.cloudfront.net/images/si...
64,rocket-academy,11483,rocketacademy.co/,Rocket Academy is a 6-month live and online co...,https://d92mrp7hetgfk.cloudfront.net/images/si...
65,skillcrush,11020,skillcrush.com,"Skillcrush has more than 17,000 students in al...",https://d92mrp7hetgfk.cloudfront.net/images/si...
