Created by [SmirkyGraphs](https://smirkygraphs.github.io/). Code: [Github](https://github.com/SmirkyGraphs/Python-Notebooks). Source: [OLIS](http://www.olis.ri.gov/grants/srrc/data/a-z.php).
<hr>

# A to Z Databases Data Prep

A to Z Databases offers reference and marketing databases for job seekers, students and researchers. 
This data is provided through Rhode Islands Office of Library & Information Services and ASK-RI.
The data includes information on searches, views, printed, emailed and downloaded aggregated by month.

To use this you will have to download each file from [here](http://www.olis.ri.gov/grants/srrc/data/a-z.php) add them all to the `/data/raw` folder.
<hr>

## Collecting Data

Each A-Z database metrics file comes labeled `(fyindex)-month, a-z fy(year).xlsx` 

This section of code
was created to merge the 40+ files into 1 marked by date and fiscal year using glob and pandas.

In [1]:
import pandas as pd
import glob

In [2]:
# folder with files

files = glob.glob('./data/raw/*.xlsx')

In [3]:
# merge files into 1

frames = []

for file in files:
    
    # read files
    
    df = pd.read_excel(file)
    
    # add fy and month from filename
    
    df['fy'] = '20' + file[-7:-5]
    df['fy-index'] = file[11:13]
    df['month'] = file[14:17]
    
    # add to frames list
    
    frames.append(df)
    
df = pd.concat(frames)

df.to_csv('./data/clean/a-z_database_merged.csv', index=False)

In [4]:
df.head()

Unnamed: 0,Database,# of Searches,# of Pages Viewed,# of Records Viewed,# of Details Viewed,# of Records Printed,# of Records Emailed,# of Records Downloaded,fy,fy-index,month
0,30 Million Businesses & Executives-External,1841,5264,113826,7176,4158,3129,54446,2016,1,jul
1,30 Million Businesses & Executives-Internal,316,645,13660,117,151,4294,1910,2016,1,jul
2,Universal (Search all databases)-External,415,850,16245,0,0,1,0,2016,1,jul
3,Universal (Search all databases)-Internal,240,309,4938,0,0,9,51,2016,1,jul
4,220 Million Residents-External,3202,7349,172753,148,16,447,63939,2016,1,jul


## Cleaning the Data

- split internal/external traffic
- combine same database names
- remove total row
- convert fy/month to a date
- normalize database names

In [5]:
df = pd.read_csv('./data/clean/a-z_database_merged.csv')

In [6]:
# function to convert to a date ex) 6/1/2017

def convert_date(x):
    
    month, fy = x
    
    if int(month) > 6:
        fy = str(int(fy) - 1)
        date = str(month) + '/1/' + str(fy)
        
    else:
        date = str(month) + '/1/' + str(fy)
        
    return date

In [7]:
# list to convert months

month = {'jan':'1', 'feb':'2', 'mar':'3', 'apr':'4', 'may':'5', 'jun':'6',
         'jul':'7', 'aug':'8', 'sep':'9', 'oct':'10', 'nov':'11', 'dec':'12'}

In [8]:
# lists of databases with similar names

healthcare = ['1.1 Million Healthcare Professionals',
              '7.9 Million Healthcare Professionals',
              '12 Million Healthcare Professionals']

movers_owners = ['200,000 NEW Movers Added Weekly',
                 '350,000 NEW Movers/HomeOwners added weekly',
                 '200,000 Newmovers added weekly',
                 '50,000 NEW Homeowners Added Weekly',
                 '50.000 New Home Owners added weekly']

residents = ['220 Million Residents',
             '240 Million Residents']

business = ['30 Million Businesses & Executives',
            '2 Million NEW Businesses']

In [9]:
# cleaning the data

# convert abbreviated month to number format

df['month'] = df['month'].map(month)

# split database and connection method

df['method'] = df['Database'].apply(lambda x: x.split('-')[-1])
df['database'] = df['Database'].apply(lambda x: x.split('-')[0])

# groupby same database name

cols = ['database', 'method', 'fy', 'month', 'fy-index']
df = df.groupby(cols).sum().reset_index()

# remove total

df = df[df['database'] != 'Total']

# convert month to date format

df['month'] = df[['month','fy']].apply(convert_date, axis=1)

# normalize names across months

df.loc[(df['database'].isin(healthcare)), 'database'] = 'Healthcare Professionals'
df.loc[(df['database'].isin(movers_owners)), 'database'] = 'Movers/Homeowners'
df.loc[(df['database'].isin(residents)), 'database'] = 'Residents'
df.loc[(df['database'].isin(business)), 'database'] = 'Businesses & Executives'

In [10]:
df.head()

Unnamed: 0,database,method,fy,month,fy-index,# of Searches,# of Pages Viewed,# of Records Viewed,# of Details Viewed,# of Records Printed,# of Records Emailed,# of Records Downloaded
0,Healthcare Professionals,External,2016,1/1/2016,7,1,14,350,0,0,0,0
1,Healthcare Professionals,External,2016,10/1/2015,4,4,15,368,1,243,75,0
2,Healthcare Professionals,External,2016,11/1/2015,5,8,6,121,2,0,0,0
3,Healthcare Professionals,External,2016,12/1/2015,6,2,4,100,2,0,0,0
4,Healthcare Professionals,External,2016,2/1/2016,8,14,52,1300,4,0,0,0


In [11]:
# export final cleaned merged file

df.to_csv('./data/clean/a-z_database_clean.csv', index=False)