# SQL Project
You were hired by Ironhack to perform an Analytics Consulting Project entitled: competitive landscape.

Your mission is to create and populate an appropriate database with many coding schools that are our competition, as well as design an suitable queries that answer business questions of interest (to be defined by you)


**Suggested Steps in the Project:**


*   Read this notebook and understand each function. Comment the code appropriately

*   Populate the list of schools with a wider variety of schools (how are you going to get the school ID?)

* Take a look at the obtained dataframes. What dimensions do you have? what keys do you have? how could the different dataframes be connected?

* Go back to the drawing board and try to create an entity relationship diagram for tables available

* Once you have the schemas you want, you will need to:
  - create the suitable SQL queries to create the tables and populate them
  - run these queries using the appropriate Python connectors
  
* Bonus: How will this datamodel be updated in the future? Please write auxiliary functions that test the database for data quality issues. For example: how could you make sure you only include the most recent comments when you re-run the script?


# Suggested Deliverables

* 5-6 minute presentation of data model created, decision process and business analysis proposed

* exported .sql file with the final schema

* Supporting python files used to generate all logic

* High level documentation explaining tables designed and focusing on update methods

Crucial hint: check out the following tutorial:
https://www.dataquest.io/blog/sql-insert-tutorial/


In [26]:
# you must populate this dict with the schools required -> try talking to the teaching team about this


schools = {   
'ironhack' : 10828,
'la-capsule' : 10853,
'app-academy' : 10525,
'springboard' : 11035,
'metis' : 10886,
'practicum-by-yandex' : 11225,
'le-wagon' : 10868,
'academia-de-codigo' :10494 ,
'react-graphql-academy' : 10972
}

import re
import pandas as pd
from pandas.io.json import json_normalize
import requests



def get_comments_school(school):
    TAG_RE = re.compile(r'<[^>]+>')
    # defines url to make api call to data -> dynamic with school if you want to scrape competition
    url = "https://www.switchup.org/chimera/v1/school-review-list?mainTemplate=school-review-list&path=%2Fbootcamps%2F" + school + "&isDataTarget=false&page=3&perPage=10000&simpleHtml=true&truncationLength=250"
    #makes get request and converts answer to json
    # url defines the page of all the information, request is made, and information is returned to data variable
    data = requests.get(url).json()
    #converts json to dataframe
    reviews =  pd.DataFrame(data['content']['reviews'])
  
    #aux function to apply regex and remove tags
    def remove_tags(x):
        return TAG_RE.sub('',x)
    reviews['review_body'] = reviews['body'].apply(remove_tags)
    reviews['school'] = school
    return reviews

In [27]:
# could you write this as a list comprehension? ;)
# comments = []

# for school in schools.keys():
#    print(school)
#    comments.append(get_comments_school(school))
    
comments = [get_comments_school(key) for key in schools.keys()]

comments = pd.concat(comments)
comments.sample(50)

Unnamed: 0,id,name,anonymous,hostProgramName,graduatingYear,isAlumni,jobTitle,tagline,body,rawBody,...,queryDate,program,user,overallScore,comments,overall,curriculum,jobSupport,review_body,school
476,246332,Axel Dahlin,False,Software Engineering,2018.0,True,,Amazing coding journey,"<span class=""truncatable""><p>The Ironhack boot...",The Ironhack bootcamp transformed me and my cl...,...,2019-01-07,Full-time Web Development Bootcamp,{'image': None},5.0,[],5.0,5.0,5.0,The Ironhack bootcamp transformed me and my cl...,ironhack
125,266749,Andrew Chen,False,Software Engineering,2020.0,False,,"A rigorous course but, ultimately, a great one.","<span class=""truncatable""><p>App Academy's cur...",App Academy's curriculum is well designed and ...,...,2020-07-24,Software Engineer Track: In-Person,{'image': None},4.7,[],5.0,4.0,5.0,App Academy's curriculum is well designed and ...,app-academy
951,233391,Anonymous,False,,2016.0,True,,Worth What You Put In,"<span class=""truncatable""><p>App Academy is di...",App Academy is difficult. Extremely difficult....,...,2017-01-23,Software Engineer Track: In-Person,{'image': None},5.0,[],5.0,5.0,5.0,App Academy is difficult. Extremely difficult....,app-academy
726,243120,Jeff,False,Software Engineering,2018.0,True,,Great experience,"<p>Everything is made to succeed, surrounded b...","Everything is made to succeed, surrounded by a...",...,2018-03-16,Full-time Web Development Bootcamp,{'image': None},5.0,[],5.0,5.0,5.0,"Everything is made to succeed, surrounded by a...",ironhack
754,245839,Onyema Nwokolo,False,UX/UI Design,2018.0,True,,Springboard UX Design Course,"<span class=""truncatable""><p>I really enjoyed ...",I really enjoyed the way Springboard structure...,...,2018-11-30,UX Design,{'image': None},3.7,[],4.0,3.0,4.0,I really enjoyed the way Springboard structure...,springboard
1,278202,Kora Feder,False,,2021.0,False,,I learned a lot,"<span class=""truncatable""><p></p><p>I just gra...",<p>I just graduated. The best part of the prog...,...,2021-06-08,UI/UX Design Career Track,{'image': None},4.7,[],5.0,5.0,4.0,I just graduated. The best part of the program...,springboard
16,274707,Ahmed Eldemerdash,False,,2021.0,True,,Above and Beyond,"<span class=""truncatable""><p></p><p>very great...",<p>very great course in web development .. goo...,...,2021-03-13,Web Developer,{'image': None},5.0,[],5.0,5.0,5.0,very great course in web development .. good r...,practicum-by-yandex
272,261637,Anonymous,True,Software Engineering,2020.0,True,,Awesome Learning Experience,"<span class=""truncatable""><p>Pros: <br>- You r...",Pros: \r\n- You really are able to create a fu...,...,2020-02-04,Software Engineer Track: In-Person,{'image': None},5.0,[],5.0,5.0,5.0,Pros: - You really are able to create a fully ...,app-academy
267,261644,Anonymous,True,Software Engineering,2020.0,True,,Overall pretty good experience,"<span class=""truncatable""><p>Before attending,...","Before attending, I had read reviews online th...",...,2020-02-04,Software Engineer Track: In-Person,{'image': None},4.0,[],4.0,4.0,4.0,"Before attending, I had read reviews online th...",app-academy
1129,246859,Richard van den Broek,False,Software Engineering,2018.0,True,Web developer,"Effective learning method, great overall expri...","<span class=""truncatable""><p>I started my care...",I started my career in the shipping industry (...,...,2019-02-16,FullStack program - 35+ locations,{'image': None},4.7,[],5.0,5.0,4.0,I started my career in the shipping industry (...,le-wagon


In [5]:
from pandas.io.json import json_normalize

def get_school_info(school, school_id):
    url = 'https://www.switchup.org/chimera/v1/bootcamp-data?mainTemplate=bootcamp-data%2Fdescription&path=%2Fbootcamps%2F'+ str(school) + '&isDataTarget=false&bootcampId='+ str(school_id) + '&logoTag=logo&truncationLength=250&readMoreOmission=...&readMoreText=Read%20More&readLessText=Read%20Less'

    data = requests.get(url).json()

    data.keys()

    courses = data['content']['courses']
    courses_df = pd.DataFrame(courses, columns= ['courses'])

    locations = data['content']['locations']
    locations_df = json_normalize(locations)

    badges_df = pd.DataFrame(data['content']['meritBadges'])
    
    website = data['content']['webaddr']
    description = data['content']['description']
    logoUrl = data['content']['logoUrl']
    school_df = pd.DataFrame([website,description,logoUrl]).T
    school_df.columns =  ['website','description','LogoUrl']

    locations_df['school'] = school
    courses_df['school'] = school
    badges_df['school'] = school
    school_df['school'] = school
    

    locations_df['school_id'] = school_id
    courses_df['school_id'] = school_id
    badges_df['school_id'] = school_id
    school_df['school_id'] = school_id

    return locations_df, courses_df, badges_df, school_df

locations_list = []
courses_list = []
badges_list = []
schools_list = []

for school, id in schools.items():
    print(school)
    a,b,c,d = get_school_info(school,id)
    
    locations_list.append(a)
    courses_list.append(b)
    badges_list.append(c)
    schools_list.append(d)



ironhack


  locations_df = json_normalize(locations)


la-capsule
app-academy
springboard
metis
practicum-by-yandex
le-wagon
academia-de-codigo
react-graphql-academy


In [6]:
locations_list

[      id               description  country.id   country.name country.abbrev  \
 0  15901           Berlin, Germany        57.0        Germany             DE   
 1  16022       Mexico City, Mexico        29.0         Mexico             MX   
 2  16086    Amsterdam, Netherlands        59.0    Netherlands             NL   
 3  16088         Sao Paulo, Brazil        42.0         Brazil             BR   
 4  16109             Paris, France        38.0         France             FR   
 5  16375  Miami, FL, United States         1.0  United States             US   
 6  16376             Madrid, Spain        12.0          Spain             ES   
 7  16377          Barcelona, Spain        12.0          Spain             ES   
 8  16709          Lisbon, Portugal        28.0       Portugal             PT   
 9  17233                    Online         NaN            NaN            NaN   
 
    city.id    city.name city.keyword  state.id state.name state.abbrev  \
 0  31156.0       Berlin       b

In [7]:
locations = pd.concat(locations_list)

In [8]:
courses = pd.concat(courses_list)
courses.head(10)

Unnamed: 0,courses,school,school_id
0,Cyber Security Bootcamp,ironhack,10828
1,Cybersecurity Part-Time,ironhack,10828
2,Data Analytics Bootcamp,ironhack,10828
3,Data Analytics Part-Time,ironhack,10828
4,UX/UI Design Bootcamp,ironhack,10828
5,UX/UI Design Part-Time,ironhack,10828
6,Web Development Bootcamp,ironhack,10828
7,Web Development Part-Time,ironhack,10828
0,Fullstack JavaScript Web Developer,la-capsule,10853
0,Bootcamp Prep,app-academy,10525


In [24]:
badges_raw = pd.concat(badges_list)
badges_raw = badges_raw.drop_duplicates(subset=['name'])
badges_raw


Unnamed: 0,name,keyword,description,school,school_id
0,Available Online,available_online,<p>School offers fully online courses</p>,ironhack,10828
1,Verified Outcomes,verified_outcomes,<p>School publishes a third-party verified out...,ironhack,10828
2,Flexible Classes,flexible_classes,<p>School offers part-time and evening classes...,ironhack,10828
2,Job Guarantee,job_guarantee,<p>School guarantees job placement</p>,app-academy,10525


In [10]:
# superstore.insert(0,'Profitable?',(superstore['Profit'].apply(profitable)))

def badges_m(row):
    if row == 'Available Online':
        return 1
    elif row == 'Verified Outcomes':
        return 2
    elif row == 'Flexible Classes':
        return 3
    elif row == 'Job Guarantee':
        return 4

In [11]:
badges_raw.insert(0,'badges_id',(badges_raw['name'].apply(badges_m)))

In [12]:
badges = badges_raw

In [13]:
badges

Unnamed: 0,badges_id,name,keyword,description,school,school_id
0,1,Available Online,available_online,<p>School offers fully online courses</p>,ironhack,10828
1,2,Verified Outcomes,verified_outcomes,<p>School publishes a third-party verified out...,ironhack,10828
2,3,Flexible Classes,flexible_classes,<p>School offers part-time and evening classes...,ironhack,10828
2,4,Job Guarantee,job_guarantee,<p>School guarantees job placement</p>,app-academy,10525


In [14]:
badges.columns = ['badges_id','name','keyword','description','school','school_id']

In [15]:
badges

Unnamed: 0,badges_id,name,keyword,description,school,school_id
0,1,Available Online,available_online,<p>School offers fully online courses</p>,ironhack,10828
1,2,Verified Outcomes,verified_outcomes,<p>School publishes a third-party verified out...,ironhack,10828
2,3,Flexible Classes,flexible_classes,<p>School offers part-time and evening classes...,ironhack,10828
2,4,Job Guarantee,job_guarantee,<p>School guarantees job placement</p>,app-academy,10525


In [16]:
clean_badges = badges[['badges_id','name']]

In [17]:
clean_badges = clean_badges.drop_duplicates(subset=['badges_id'])

In [18]:
display(clean_badges)

Unnamed: 0,badges_id,name
0,1,Available Online
1,2,Verified Outcomes
2,3,Flexible Classes
2,4,Job Guarantee


In [19]:
clean_schools = badges[['school_id','school']]
clean_schools.columns = ['schools_id','name']
clean_schools = clean_schools.drop_duplicates(subset=['schools_id'])


In [20]:
clean_schools

Unnamed: 0,schools_id,name
0,10828,ironhack
2,10525,app-academy


In [21]:
schools

{'ironhack': 10828,
 'la-capsule': 10853,
 'app-academy': 10525,
 'springboard': 11035,
 'metis': 10886,
 'practicum-by-yandex': 11225,
 'le-wagon': 10868,
 'academia-de-codigo': 10494,
 'react-graphql-academy': 10972}

In [22]:
clean_badges_schools = badges.drop_duplicates(subset=['name','badges_id'])
clean_badges_schools = clean_badges_schools[['badges_id','school_id']]
clean_badges_schools.insert(0,'school_badges_id',(range(1,5)))
clean_badges_schools

Unnamed: 0,school_badges_id,badges_id,school_id
0,1,1,10828
1,2,2,10828
2,3,3,10828
2,4,4,10525


In [23]:
clean_comments = comments.merge(schools, how='inner', on='school')
to_drop = ['graduatingYear','tagline','body','rawBody','user','comments','review_body','anonymous','createdAt','jobTitle','hostProgramName','queryDate']
clean_comments.drop(to_drop,inplace=True,axis=1,)
clean_comments.drop(['website','description','LogoUrl','school'], inplace=True,axis=1)
clean_comments = clean_comments.fillna(0)
clean_comments['overall'] = clean_comments['overall'].apply(lambda x : float(x))
clean_comments['overallScore'] = clean_comments['overallScore'].apply(lambda x : float(x))
clean_comments['curriculum'] = clean_comments['curriculum'].apply(lambda x : float(x))
clean_comments['jobSupport'] = clean_comments['jobSupport'].apply(lambda x : float(x))

clean_comments

TypeError: Can only merge Series or DataFrame objects, a <class 'dict'> was passed

In [None]:
locations_clean = locations.copy()
to_drop = ['description','country.id','country.abbrev','city.id','city.keyword','state.id','state.name','state.abbrev','state.keyword']
locations_clean.drop(to_drop,inplace=True,axis=1,)
locations_clean.rename(columns = {'id':'location_id','country.name':'country','city.name':'city'}, inplace = True)

In [None]:
clean_locations = locations_clean
clean_locations = clean_locations.fillna('Online')

In [None]:
clean_locations

In [None]:
display(clean_schools)
display(clean_badges)
display(clean_badges_schools)
display(clean_comments)
display(clean_locations)

In [196]:
#################CREATING A CONNECTION TO THE DATABASE#####################
import pymysql
import getpass
import mysql.connector
from sqlalchemy import create_engine

# Connect to the database

host="database-1.cesj3b2ko52z.us-east-2.rds.amazonaws.com"
port=3306
dbname="project"
user="root"
password="12345678"


#engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
#                       .format(user="root",
#                               pw="1234",
#                               db="project"))

engine = create_engine('mysql+mysqlconnector://{0}:{1}@{2}/{3}'
            .format(user, password,host, dbname)).connect()


# create cursor


In [197]:
clean_comments.to_sql('comments', con = engine,if_exists = 'replace')

In [200]:
clean_schools.to_sql('schools', con = engine,if_exists = 'replace')

In [201]:
clean_badges.to_sql('badges', con = engine,if_exists = 'replace')

In [202]:
clean_badges_schools.to_sql('school_badges_id', con = engine,if_exists = 'replace')

In [204]:
clean_locations.to_sql('locations', con = engine,if_exists = 'replace')

In [198]:
query = pd.read_sql_query("""SELECT * FROM comments;""", engine)

In [199]:
query

Unnamed: 0,index,id,name,isAlumni,program,overallScore,overall,curriculum,jobSupport,school_id
0,0,276568,Guilherme golabek brein,0,Web Development Part-Time,1.0,1.0,1.0,1.0,10828
1,1,276147,Charlotte Urvoy,0,UX/UI Design Bootcamp,5.0,5.0,5.0,5.0,10828
2,2,275972,Anonymous,0,UX/UI Design Bootcamp,4.0,5.0,4.0,3.0,10828
3,3,275872,Ahmad Khalaf,0,UX/UI Design Bootcamp,4.0,4.0,4.0,4.0,10828
4,4,275855,Morgane Favchtein,0,UX/UI Design Bootcamp,4.3,5.0,4.0,4.0,10828
...,...,...,...,...,...,...,...,...,...,...
3017,3017,234894,Stephanie S.,1,UX Design,5.0,5.0,5.0,5.0,11035
3018,3018,234877,Joe Fang,1,UX Design,5.0,5.0,5.0,5.0,11035
3019,3019,234838,Zeina,1,UX Design,3.7,4.0,4.0,3.0,11035
3020,3020,242681,Jean,1,UX Design,5.0,5.0,5.0,5.0,11035
