# National University Ranking

![](2023-07-13-00-48-01.png)

ROUGH OUTLINE
1. extract the state short code from location
2. extract the year founded from description
3. clean up the fees, in-state and enrollment columns
4. clean up the columns names

* 51 states

5. seperate each state schools
6. each state will be a table of it's own in the database for database optimization
7. create a state table
8. can rank school inside each state based on enrollment

-- EDA

9. Top school based on rank, tuition fees, instate fees, and enrollment
10. Top 2 schools within 56 states
11. oldest schools based on year founded
12. oldest school within 51 states

-- App interface

13. A brief overview with visuals from top school based on rank (overall)
14. selection box to select state. Once selected, brief overview with a visual of the top school in that state.
* Overview will include details based on the available data like average tuition and top ranked school.
15. Another section where we take in input from the user to recommend a movie within a choosen state.
16. If no state is choosen, we recommend based on the user location, and closest state.
* For closest state, we can do some resear to know which state is closer to each other (Feature engineering) 

## Objectives
Outline breakdown

### Data Cleaning: 
* Objective: Clean the data and feature engineer new columns
* Tools: Python

### EDA
* Objective: Create visuals from the cleaned data
* Tools: Power BI, Python

### Database Engineers
* Objective: Create the database and the schema. Views if possible
* Tools: SQL

### Project App
* Objective: Create the user interface
* Tools: Python, SQL, USer

### Technical Writers
* Objective: Document every process
* Microsoft/Google Suites (Slide, Docs, Excel, etc)

## Have a video meeting to discuss milestone achievement

### Developing a searchable database to help high school students identify colleges that match their criteria in terms of tuition, graduation rate, location, and rank.

### Import Libraries

In [2]:
import pandas as pd
import plotly.express as px


In [5]:
# Loading the dataset
nur = pd.read_csv("dataset/National Universities Rankings.csv", index_col=0)
nur.shape

(231, 7)

In [4]:
nur.head()

Unnamed: 0_level_0,Name,Location,Rank,Description,Tuition and fees,In-state,Undergrad Enrollment
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Princeton University,"Princeton, NJ",1,"Princeton, the fourth-oldest college in the Un...","$45,320",,5402
125,Duquesne University,"Pittsburgh, PA",124,"Founded in 1878, Duquesne University is a priv...","$35,062",,5961
2,University of Chicago,"Chicago, IL",3,"The University of Chicago, situated in Chicago...","$52,491",,5844
3,Yale University,"New Haven, CT",3,"Yale University, located in New Haven, Connect...","$49,480",,5532
4,Columbia University,"New York, NY",5,"Columbia University, located in Manhattan's Mo...","$55,056",,6102


In [7]:
df_nur = nur.copy()

### 1. 

In [None]:
# Remove the dollar sign ($) and comma
df_nur['Tuition and fees'] = df_nur['Tuition and fees'].str.replace('$', '').str.replace(',', '')

# convert the column datatype to integer
df_nur['Tuition and fees'] = df_nur['Tuition and fees'].fillna('0').astype(int)

  df_nur['Tuition and fees'] = df_nur['Tuition and fees'].str.replace('$', '').str.replace(',', '')


In [None]:
df_nur['Tuition and fees'].sample(10)

index
146    21208
95     51030
114    21451
117    30968
211    27028
64     46994
182    27684
3      49480
144    25673
79     40241
Name: Tuition and fees, dtype: int32

In [None]:
df_nur.sample(10)

Unnamed: 0_level_0,Name,Location,Rank,Description,Tuition and fees,In-state,Undergrad Enrollment
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
51,Villanova University,"Villanova, PA",50,"Villanova University, named for the Spanish Au...",49280,,6994
128,Arizona State University--Tempe,"Tempe, AZ",129,"Arizona State University--Tempe, which has one...",25458,"$10,158",41828
219,Benedictine University,"Lisle, IL",220,"Founded in 1887, Benedictine University is a p...",32170,,3347
22,University of Southern California,"Los Angeles, CA",23,The University of Southern California's centra...,52217,,18810
97,Stony Brook University--SUNY,"Stony Brook, NY",96,Stony Brook University is one of 64 schools in...,26266,"$9,026",16831
146,Ohio University,"Athens, OH",146,Freshmen at Ohio University (OU) in Athens can...,21208,"$11,744",23513
49,Pepperdine University,"Malibu, CA",50,Squeezed in among the Santa Monica Mountain fo...,50022,,3533
37,University of California--Santa Barbara,"Santa Barbara, CA",37,Located 100 miles up the coast from Los Angele...,40704,"$14,022",20607
57,University of Georgia,"Athens, GA",56,"At its founding, The University of Georgia mad...",29844,"$11,634",27547
55,George Washington University,"Washington, DC",56,George Washington University's urban location ...,51950,,11157


### 2

### 3

In [4]:
nur['Location'].apply(lambda x: x.split(',')[1]).unique()

array([' NJ', ' MA', ' IL', ' CT', ' NY', ' CA', ' NC', ' PA', ' MD',
       ' NH', ' RI', ' TX', ' IN', ' TN', ' MO', ' GA', ' DC', ' VA',
       ' MI', ' OH', ' LA', ' FL', ' WI', ' WA', ' SC', ' UT', ' MN',
       ' DE', ' CO', ' IA', ' OK', ' VT', ' AL', ' OR', ' NE', ' KS',
       ' AZ', ' KY', ' AR', ' MS', ' HI', ' ID', ' WY', ' NM', ' ME',
       ' WV', ' ND', ' NV', ' SD', ' AK', ' MT'], dtype=object)

In [5]:
len(nur['Location'].apply(lambda x: x.split(',')[1]).unique())

51

In [7]:
df_nur.dtypes

Name                    object
Location                object
Rank                     int64
Description             object
Tuition and fees        object
In-state                object
Undergrad Enrollment    object
dtype: object

In [8]:
df_nur.isna().sum()

Name                     0
Location                 0
Rank                     0
Description              0
Tuition and fees         0
In-state                98
Undergrad Enrollment     0
dtype: int64

### Cleaning the undergraduate enrollment column

In [41]:
# create a copy of dataframe
df_nur = nur.copy()

In [42]:
# access dataframe
df_nur.head(1)

Unnamed: 0_level_0,Name,Location,Rank,Description,Tuition and fees,In-state,Undergrad Enrollment
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,Princeton University,"Princeton, NJ",1,"Princeton, the fourth-oldest college in the Un...","$45,320",,5402


In [43]:
# access datatype
df_nur.dtypes

Name                    object
Location                object
Rank                     int64
Description             object
Tuition and fees        object
In-state                object
Undergrad Enrollment    object
dtype: object

In [44]:
# data cleaning by replacing ',' with '' and converting column to integer
df_nur['Undergrad Enrollment'] = df_nur['Undergrad Enrollment'].str.replace(',','').astype('int')

In [45]:
# check
df_nur.sample(5)

Unnamed: 0_level_0,Name,Location,Rank,Description,Tuition and fees,In-state,Undergrad Enrollment
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
92,North Carolina State University--Raleigh,"Raleigh, NC",92,"North Carolina State University, also known as...","$26,399","$8,880",24111
53,Ohio State University--Columbus,"Columbus, OH",54,"Located in the state capital of Columbus, The ...","$29,229","$10,037",45289
140,University of Cincinnati,"Cincinnati, OH",135,The University of Cincinnati is a public schoo...,"$26,334","$11,000",25054
147,San Diego State University,"San Diego, CA",146,"Founded in 1897, San Diego State University is...","$18,244","$7,084",29234
118,Seton Hall University,"South Orange, NJ",118,"Seton Hall University is a private, Catholic s...","$39,258",,6090


In [46]:
# check
df_nur.dtypes

Name                    object
Location                object
Rank                     int64
Description             object
Tuition and fees        object
In-state                object
Undergrad Enrollment     int32
dtype: object