Importing the pandas package for reading the dataset

In [24]:
import pandas as pd

Reading the data from enrollment_by_zip.csv file

In [25]:
dataframe = pd.read_csv("enrollment-by-zipcode.csv")

Displaying the header rows of the DataFrame

In [26]:
dataframe.head()

Unnamed: 0,SEMESTER,SPRADDR_ZIP_PR,IRO_INSTITUTION_DESCL,HAWAIIAN_LEGACY,ENROLLMENT
0,Fall 2012,96766,Kaua`i Community College,HAWAIIAN,73
1,Fall 2012,96814,Kapi`olani Community College,,19
2,Fall 2012,96792,Leeward Community College,HAWAIIAN,504
3,Fall 2012,96822,Honolulu Community College,,12
4,Fall 2012,96821,Kapi`olani Community College,,43


In [27]:
dataframe.shape

(87798, 5)

Cleaning the data so that all zip codes are normalized to just the ZIP5 format (e.g. 80246) not
the ZIP5 + 4 (e.g. 80248-1234) format

In [28]:
dataframe = dataframe[dataframe['SPRADDR_ZIP_PR'].apply(str).apply(len) == 5]

In [29]:
dataframe['SPRADDR_ZIP_PR'] = pd.to_numeric(dataframe['SPRADDR_ZIP_PR'], errors='coerce')

In [30]:
dataframe = dataframe.dropna(subset=['SPRADDR_ZIP_PR']).set_index('SPRADDR_ZIP_PR')

In [31]:
dataframe = dataframe.reset_index()

In [32]:
dataframe['SPRADDR_ZIP_PR'] = dataframe['SPRADDR_ZIP_PR'].astype(int)

In [33]:
dataframe.head()

Unnamed: 0,SPRADDR_ZIP_PR,SEMESTER,IRO_INSTITUTION_DESCL,HAWAIIAN_LEGACY,ENROLLMENT
0,96766,Fall 2012,Kaua`i Community College,HAWAIIAN,73
1,96814,Fall 2012,Kapi`olani Community College,,19
2,96792,Fall 2012,Leeward Community College,HAWAIIAN,504
3,96822,Fall 2012,Honolulu Community College,,12
4,96821,Fall 2012,Kapi`olani Community College,,43


In [34]:
dataframe.shape

(73562, 5)

Creating a column YEAR in your dataset for the year you will notice that the original column SEMESTER
is the form Fall 20nn

In [36]:
dataframe['YEAR'] = dataframe.SEMESTER.str.split(" ").str[-1]

In [37]:
dataframe.head()

Unnamed: 0,SPRADDR_ZIP_PR,SEMESTER,IRO_INSTITUTION_DESCL,HAWAIIAN_LEGACY,ENROLLMENT,YEAR
0,96766,Fall 2012,Kaua`i Community College,HAWAIIAN,73,2012
1,96814,Fall 2012,Kapi`olani Community College,,19,2012
2,96792,Fall 2012,Leeward Community College,HAWAIIAN,504,2012
3,96822,Fall 2012,Honolulu Community College,,12,2012
4,96821,Fall 2012,Kapi`olani Community College,,43,2012


Dropping the HAWAIIAN_LEGACY column 

In [38]:
dataframe = dataframe.drop(['HAWAIIAN_LEGACY'],axis=1)

In [39]:
dataframe.head()

Unnamed: 0,SPRADDR_ZIP_PR,SEMESTER,IRO_INSTITUTION_DESCL,ENROLLMENT,YEAR
0,96766,Fall 2012,Kaua`i Community College,73,2012
1,96814,Fall 2012,Kapi`olani Community College,19,2012
2,96792,Fall 2012,Leeward Community College,504,2012
3,96822,Fall 2012,Honolulu Community College,12,2012
4,96821,Fall 2012,Kapi`olani Community College,43,2012


Eliminating any rows of data with NaN data in it

In [40]:
dataframe = dataframe.dropna()

In [44]:
dataframe.head()

Unnamed: 0,SPRADDR_ZIP_PR,SEMESTER,IRO_INSTITUTION_DESCL,ENROLLMENT,YEAR
0,96766,Fall 2012,Kaua`i Community College,73,2012
1,96814,Fall 2012,Kapi`olani Community College,19,2012
2,96792,Fall 2012,Leeward Community College,504,2012
3,96822,Fall 2012,Honolulu Community College,12,2012
4,96821,Fall 2012,Kapi`olani Community College,43,2012


In [47]:
dataframe.shape

(73562, 5)

Storing the entire dataset back into a new CSV file called hawaii_enrollments.csv.

In [48]:
dataframe.to_csv('hawaii_enrollments.csv')