# Geometry Common Core Regents Exam EDA

This notebook will be the exploratory data analysis done to determine any possible consistences and patterns in the questions of the Geometry CC Regents exam. Specifically, we are trying to answer:

* Is the frequency of each cluster of questions consistent through exam?
* Does the frequency of each cluster hold true with the engageny guidelines
* Are there clusters being skipped? If so why and is there consistence with those omissions?
* Which clusters in a domain are more widely assessed?

## Loading Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine, MetaData, Table, select

%matplotlib inline

## Loading the data from Postgresql

In [3]:
# create engine
engine = create_engine('postgresql+psycopg2://postgres:password@localhost:####/Regents Exams DataBase')

# connection
conn = engine.connect()

# acquire table
metadata = MetaData()
sql_table = Table('Geometry', metadata, autoload=True, autoload_with=engine)

# assign query to a variable with python sqlalchemy select
results = conn.execute(select([sql_table])).fetchall()

df = pd.DataFrame(results, columns=sql_table.columns.keys())

## Preliminary Checks of Data

In [4]:
# check data
df.head()

Unnamed: 0,id,ClusterTitle,Cluster,Regents Date,Type,DateFixed,Qnumber
0,1,Visualize relationships between two-dimensiona...,G-GMD.B,2015-06-01,MC,Jun-15,1
1,2,Understand congruence in terms of rigid motions,G-CO.B,2015-06-01,MC,Jun-15,2
2,3,Use coordinates to prove simple geometric theo...,G-GPE.B,2015-06-01,MC,Jun-15,3
3,4,Experiment with transformations in the plane,G-CO.A,2015-06-01,MC,Jun-15,4
4,5,Define trigonometric ratios and solve problems...,G-SRT.C,2015-06-01,MC,Jun-15,5


In [5]:
# Check dataframe info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 7 columns):
id              358 non-null int64
ClusterTitle    358 non-null object
Cluster         358 non-null object
Regents Date    358 non-null object
Type            358 non-null object
DateFixed       358 non-null object
Qnumber         358 non-null object
dtypes: int64(1), object(6)
memory usage: 19.7+ KB


Don't need the id since that was only created for SQL table. Will drop this now to reduce memory usage.

In [10]:
# drop id column
df.drop(['id'],axis='columns',inplace=True)

In [11]:
# check if column was dropped and for memory
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 358 entries, 0 to 357
Data columns (total 6 columns):
ClusterTitle    358 non-null object
Cluster         358 non-null object
Regents Date    358 non-null object
Type            358 non-null object
DateFixed       358 non-null object
Qnumber         358 non-null object
dtypes: object(6)
memory usage: 16.9+ KB


About 3 KB less memory used. Let's check out the exact memory usage of each column in bytes.

In [13]:
# check dataframe memory usage by column in bytes
df.memory_usage()

Index             80
ClusterTitle    2864
Cluster         2864
Regents Date    2864
Type            2864
DateFixed       2864
Qnumber         2864
dtype: int64

In [20]:
# Get exact types of each column
for column in df.columns:
    print(type(df[column][0]))

<class 'str'>
<class 'str'>
<class 'datetime.date'>
<class 'str'>
<class 'str'>
<class 'str'>


In [25]:
# Date stamp at 7/19/18 
# will continue tomorrow. Wanted to check if changing data from str to categorical column reduces memory further
df['Cluster']=df['Cluster'].astype('category')

In [26]:
# turns out it does. See below
df.memory_usage()

Index             80
ClusterTitle    2864
Cluster         1110
Regents Date    2864
Type            2864
DateFixed       2864
Qnumber         2864
dtype: int64