### Familiarize yourself with your Data
> 1. Python Libs
> - Reading Data into Python
> - Contents of data set(s)
> - First 'n' data records
> - Column & Row views 
> - Converting variables (strings, numbers, etc.)
> - Data Dimmensions & Indexes
> - How to handle MISSING values
> - Drop Rows and Columns

### 1. Python Libs

SAS and SQL have procedures or functions available for you to use such as PROC’s in SAS and depending on your SQL flavor ‘partition over by.’ In Python, you’ll either need to write a function or import a library which has functions for your needs. It’s good practice to glance over the library documentation and use Google when you have questions. There is a plethora of information pertaining to library utilization including the code and examples. Unlike SAS and SQL, Python (& R) require you to install and import libs, which are essentially global and system macros OR functions.

##### Some essential Python Libs:https://medium.com/activewizards-machine-learning-company/top-15-python-libraries-for-data-science-in-in-2017-ab61b4f9b4a7

In [4]:
from pandasql import sqldf #!pip install -U pandasql
pysqldf = lambda q: sqldf(q, globals())
from collections import Counter
import pandas as pd #Pandas Lib
import numpy as np  #NumPy Lib
import matplotlib.pyplot as plt

### 2. Reading Data into Python

Though there are multiple data mediums and repositories, I only plan to cover one simple data source such as CSV. However, each lib has documentation on how to connect to RDBMS, Hadoop, or other environments.  

- SAS:
Uses 'proc import' with various options to accomplish the exact same things
~~~
proc import datafile="file.csv"
        out=something1
        dbms=csv
        replace;    
        getnames=no;
run;
~~~

- Hive & other SQL flavors: 
~~~~
drop table something;
CREATE TABLE something (TRANS_CONTROL_NUM string ,...) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 
LOAD DATA INPATH "path" INTO TABLE something;
~~~~

*Working with RDBM*:  
- https://www.dataquest.io/blog/python-pandas-databases/
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql.html      

In [None]:
#a) Create a CSV from an existing Data Frame ('save as' in a working directory)
path = '/data01/readmits.csv' #you can simply put a working path in place of  '/data01/readmits.csv'
df.to_csv(path)

#b) Read a CSV 
df = pd.read_csv(path)

#c) Read a CSV with no column names/headers
df = pd.read_csv(path , header=None)

#d) Read a CSV whilst defining column names
df = pd.read_csv(path, names=['mem_id', 'admit_dt', 'discharge_dt', 'Age', 'ICD1_DX', 'dis_disp'])

#e) Read while specifying "." for missing (there is a whole section on missing)
df = pd.read_csv(path, na_values=['.']) #we usually use "." in SAS for numeric

Let's read in our own data set from the plethora of data available on the Govt's website: 

*data owned and sourced from: https://www.data.gov/*

In [83]:
url = "https://data.medicare.gov/api/views/9n3s-kdb3/rows.csv?accessType=DOWNLOAD" 
life = pd.read_csv(url)

### 3. Contents of data set(s)

If you recollect, in SAS you can view contents of data by simply using **'proc contents'** with options such as *'short'*. In Hive QL, you can use **'describe table'**. 

In [18]:
#curious how long this function might take (SAS and SQL provide this in the log)
%time 

#Contents
life.info()

Wall time: 0 ns
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19878 entries, 0 to 19877
Data columns (total 12 columns):
Hospital Name                 19878 non-null object
Provider Number               19878 non-null int64
State                         19878 non-null object
Measure Name                  19878 non-null object
Number of Discharges          19878 non-null object
Footnote                      5563 non-null float64
Excess Readmission Ratio      19878 non-null object
Predicted Readmission Rate    19878 non-null object
Expected Readmission Rate     19878 non-null object
Number of Readmissions        19878 non-null object
Start Date                    19878 non-null object
End Date                      19878 non-null object
dtypes: float64(1), int64(1), object(10)
memory usage: 1.8+ MB


### 4. First 'n' data records

In **SAS** we can accomplish this in many ways! I have illustrated three:
- data see; set source (obs=5); run;
- proc sql; select * from source (obs=5); quit; 
- you can also use monotonic to select n rows which relies on an index

In **SQL** (flavor dependent): 
- Hive: select * from source limit 10;

In [22]:
#print first 2 records
life.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15


In [23]:
#Select specific columns to look at and the first 2 records
life[['Measure Name', 'Hospital Name']].head(2)

Unnamed: 0,Measure Name,Hospital Name
0,READM-30-AMI-HRRP,SOUTHEAST ALABAMA MEDICAL CENTER
1,READM-30-CABG-HRRP,SOUTHEAST ALABAMA MEDICAL CENTER


In [25]:
#View the first two records in a data set or matrix using an index approach
life[:2]

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15


**Bonus:** *I thought it would be cool if we could:*
- Find a SQL lib
- Use the SQL lib to get the first 2 records
- Assign it to a data frame

In [30]:
qry ="""SELECT * FROM life limit 2;"""  
df = pysqldf(qry) 
df

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15


### 5. Column & Row views 

You may want to see the contents of a specifc column or row and I could see this being used for tables that feed into process which iterates through each row. In SAS you could use CALL EXECUTE to iterate through a data set:
~~~~
data _NULL_;
set call_rx_feed;
call execute('%rx_pull('||_log_host||','||db_schema||','||db_path||')');	
run;
~~~~



In [32]:
df = life

In [34]:
#View a column
df['State'].head(2)

0    AL
1    AL
Name: State, dtype: object

In [38]:
#proc freq proc freq data=df; tables state / nocol nopercent nocum; run;
x = df['State'].value_counts().sort_index()
x.head()

AK      48
AL     498
AR     264
AZ     384
CA    1776
Name: State, dtype: int64

In [41]:
#View 2+ columns
df[['State', 'Provider Number']].head()

Unnamed: 0,State,Provider Number
0,AL,10001
1,AL,10001
2,AL,10001
3,AL,10001
4,AL,10001


In [43]:
#View Row... not the best way but a demonstration
df.ix[1]

Hospital Name                 SOUTHEAST ALABAMA MEDICAL CENTER
Provider Number                                          10001
State                                                       AL
Measure Name                                READM-30-CABG-HRRP
Number of Discharges                                       273
Footnote                                                   NaN
Excess Readmission Ratio                                1.0618
Predicted Readmission Rate                             13.8887
Expected Readmission Rate                              13.0809
Number of Readmissions                                      40
Start Date                                           01-JUL-12
End Date                                             30-JUN-15
Name: 1, dtype: object

### 6. Converting variables (strings, numbers, etc.)

**SAS**
>Convert variable values from character to numeric or from numeric to character:

* Character to numeric
~~~~
data new;
   char_var = '12345678';
   numeric_var = input(char_var, 8.);
run;
~~~~
* Numeric to character
~~~~
new_variable = put(numeric_variable, format.);
~~~~
* Character to Date
~~~~
input(char_var,date9.);
~~~~

**HIVE**
* STRING to a BIGINT
~~~~
SELECT CAST('00321' AS BIGINT) FROM table;
~~~~
* BIGINT to a STRING
~~~~
https://stackoverflow.com/questions/32576187/hive-converting-from-double-to-string-not-in-scientific
~~~~
* Character to Date
~~~~
from_unixtime(unix_timestamp(a.first_date_of_svc, 'yyyyMMdd'))
~~~~

In [45]:
df = life
cols_to_convert = ['Number of Discharges', 'Expected Readmission Rate']

In [46]:
#TO NUMERIC
#  You'll have to iterate over each column in the newly defined data frame ('df') 
#  where columns that are converted come from cols_to_convert
for col in cols_to_convert:
    df[col] = pd.to_numeric(df[col], errors='coerce')
#You can do ALL columns to numeric: df.apply(pd.to_numeric)
## Did it work?
df.dtypes

Hospital Name                  object
Provider Number                 int64
State                          object
Measure Name                   object
Number of Discharges          float64
Footnote                      float64
Excess Readmission Ratio       object
Predicted Readmission Rate     object
Expected Readmission Rate     float64
Number of Readmissions         object
Start Date                     object
End Date                       object
dtype: object

In [47]:
#TO CHARACTER
# let's reverse this
for col in cols_to_convert:
    df[col] =  df[col].astype(str) 
## Did it work?
df.dtypes

Hospital Name                  object
Provider Number                 int64
State                          object
Measure Name                   object
Number of Discharges           object
Footnote                      float64
Excess Readmission Ratio       object
Predicted Readmission Rate     object
Expected Readmission Rate      object
Number of Readmissions         object
Start Date                     object
End Date                       object
dtype: object

In [None]:
df = life
cols_to_convert = ['Start Date', 'End Date']

In [51]:
#TO DATE
#  You'll have to iterate over each column in the newly defined data frame ('df') 
#  where columns that are converted come from cols_to_convert
for col in cols_to_convert:
    df[col] = pd.to_datetime(df[col], errors='coerce')   
## Did it work?
df.dtypes    

Hospital Name                         object
Provider Number                        int64
State                                 object
Measure Name                          object
Number of Discharges                  object
Footnote                             float64
Excess Readmission Ratio              object
Predicted Readmission Rate            object
Expected Readmission Rate             object
Number of Readmissions                object
Start Date                    datetime64[ns]
End Date                      datetime64[ns]
dtype: object

### 7. Data Dimmensions & Indexes



In [52]:
# Number of Records
life.shape[0] #Number of Rows
life.shape[1] #Number of Columns

12

In [54]:
# Count the number of observations by STATE (unique hospitals)
df=life.groupby('State').count()
df.head(2)

Unnamed: 0_level_0,Hospital Name,Provider Number,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AK,48,48,48,48,10,48,48,48,48,48,48
AL,498,498,498,498,164,498,498,498,498,498,498


In [55]:
# which state (indexed) greatest number of discharges
df['Number of Discharges'].idxmax()

'TX'

In [56]:
# List Unique Values
life['State'].unique() 
# Same as above
x =life['State'].unique() 
x

array(['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'FL', 'DE', 'DC', 'GA',
       'ID', 'HI', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA',
       'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY',
       'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY'], dtype=object)

In [60]:
#What is the data set indexed on?
life.index

RangeIndex(start=0, stop=19878, step=1)

In [61]:
#What index values do we currently have for 'df'
df.index.values

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY'], dtype=object)

### 8. How to handle MISSING 

Python:
* isnull()  > generates a boolean mask to indicate missing values
* notnull() > opposite of isnull()
* dropna()  > returns a filtered version of the data
* fillna()  > returns a copy of data with missing values filled or imputed


## **'None'** 
cannot be used outside NumPy/Pandas and only in arrays/vectors with data type='object'

In [65]:
none_dset = np.array([1, None, 3, None, 'B'])
none_dset

array([1, None, 3, None, 'B'], dtype=object)

## **'NaN'** 
(acronym=Not a Number), is a floating-point value recognized by ALL systems

In [67]:
nan_dset = np.array([1, np.nan, np.nan, 1] * 2) 
nan_dset

array([  1.,  nan,  nan,   1.,   1.,  nan,  nan,   1.])

## **'NaN'  & 'None'**
can be used in both environments i.e. with Pandas/NumPy & Outside

In [68]:
nan_none = pd.Series([1, np.nan, 2, None])
nan_none

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

## **isnull()**
> generates a boolean mask to indicate missing values

In [95]:
#Want to see how it's done in the background?
df_null = life.isnull()
df_null.head(5)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,False,False,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,True,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,False,False


In [112]:
#Notice that only the numeric variables will give us results
life.dtypes
df = life

#missing values in each column
df.isnull().sum(axis=0)
#OR
df.isnull().sum()

Hospital Name                     0
Provider Number                   0
State                             0
Measure Name                      0
Number of Discharges              0
Footnote                      14315
Excess Readmission Ratio          0
Predicted Readmission Rate        0
Expected Readmission Rate         0
Number of Readmissions            0
Start Date                        0
End Date                          0
dtype: int64

**Find the number of missing values**
> Example in SAS
~~~
proc freq order=freq data=df;
    tables 'State'/ nocol nopercent nocum missing;
run;
~~~

In [120]:
#Frequency specific columns
df = life
df['State'].value_counts(dropna=False).head(5)

TX    1878
CA    1776
FL    1014
NY     918
PA     900
Name: State, dtype: int64

In [119]:
#Frequency all columns
for x in life.columns:
    print (x, end="~~Number of Missing Values >")
    print (sum(df[x].isnull()))

Hospital Name~~Number of Missing Values >0
Provider Number~~Number of Missing Values >0
State~~Number of Missing Values >0
Measure Name~~Number of Missing Values >0
Number of Discharges~~Number of Missing Values >0
Footnote~~Number of Missing Values >14315
Excess Readmission Ratio~~Number of Missing Values >0
Predicted Readmission Rate~~Number of Missing Values >0
Expected Readmission Rate~~Number of Missing Values >0
Number of Readmissions~~Number of Missing Values >0
Start Date~~Number of Missing Values >0
End Date~~Number of Missing Values >0


## dropna()
> which removes NA values and returns a filtered version of the data

In [122]:
#Drop missing rows which have missing values
df_no_missing = life.dropna()
#Drop rows where all cells in that row is NA
df_cleaned = life.dropna(how='all')
df_cleaned.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15


## fillna()  
> returns a copy of data with missing values filled or imputed 

In [125]:
#Create a new column full of missing values or set it to something else number or string
df=life

In [127]:
#with some value
df['test_c'] = "Over"
df.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,test_c
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,Over
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15,Over


In [128]:
#with NaN
df=life
df['test_c'] = np.nan
df.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,test_c
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15,


In [129]:
#Fill in missing data with zeros
df = df.fillna(0)
df.head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,test_c
0,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-AMI-HRRP,781,0.0,0.9837,15.358,15.6121,119,01-JUL-12,30-JUN-15,0.0
1,SOUTHEAST ALABAMA MEDICAL CENTER,10001,AL,READM-30-CABG-HRRP,273,0.0,1.0618,13.8887,13.0809,40,01-JUL-12,30-JUN-15,0.0


## 9. Drop Rows and Columns

- SAS: In SAS we can accomplish this by te **DROP** or **KEEP** statements
- SQL: You can simply use a **SELECT** statement

In [131]:
#Drop an record/row
df=life
df.shape[0] #Number of Rows 19878
df.drop([1, 2]).shape[0] #Number of Rows 19878-2 = 19876

19876

In [132]:
#Drop a variable (column) | axis=1 tells Python I am referring to a column
df.shape[1] #Current Number of Columns = 13
df.drop('State', axis=1).shape[1] #New Number of Columns = 12

12

In [134]:
#Drop records that contain a value like 'AL'
df[df['State'] != 'AL'].head(2)

Unnamed: 0,Hospital Name,Provider Number,State,Measure Name,Number of Discharges,Footnote,Excess Readmission Ratio,Predicted Readmission Rate,Expected Readmission Rate,Number of Readmissions,Start Date,End Date,test_c
498,PROVIDENCE ALASKA MEDICAL CENTER,20001,AK,READM-30-AMI-HRRP,385,,0.8867,12.9298,14.5825,43,01-JUL-12,30-JUN-15,
499,PROVIDENCE ALASKA MEDICAL CENTER,20001,AK,READM-30-CABG-HRRP,Not Available,,0.9064,12.044,13.287,Too Few to Report,01-JUL-12,30-JUN-15,
