# Smiles United - Data Cleaning - Pre

## Overview
 
* ALWAYS RESTART AND CLEAR OUTPUT BEFORE PUSHING.YES EVEN DATA FILE AND THIS NOTEBOOK ARE IGNORED

## Imports

In [None]:
import pandas as pd
import numpy as np
import scipy.stats as stats

import seaborn as sns
sns.set_style('darkgrid', {'axes.facecolor': '0.9', "grid.color": ".6", "grid.linestyle": ":"})
sns.set_context("talk")

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## Load data

In [None]:
df = pd.read_csv("../data/pre_intervention/Smiles United Survey_August 22, 2023_14.25.csv",header=1)
df

## Rename columns

In [None]:
df.columns

In [None]:
df.columns = ['Start Date', 
              'End Date', 
              'Response Type', 
              'IP Address', 
              'Progress',
              'Duration (seconds)', 
              'Finished', 
              'Recorded Date', 
              'Response ID',
              'Recipient Last Name', 
              'Recipient First Name', 
              'Recipient Email',
              'External Data Reference', 
              'Location Latitude', 
              'Location Longitude',
              'Distribution Channel', 
              'User Language',
              'ChosenID',
              'What is your primary language?',
              'What is your primary language? - Other',
              'Are you a:',
              'Are you a: - Other',
              'Which of the following best describes your Race/Ethnicity?',
              'Do you identify as:',
              'Which of the following best describes the area you live in?',
              'Before today, I have received training on how to provide direct oral health care for individuals with special health care needs',
              'Before today, I have received training on how to provide direct oral health care for individuals with special health care needs - Yes. If you answered YES to a previous training, please describe here:',
              'Fluoridated products, such as fluoridated toothpaste and fluoridated water, can help improve the oral health of residents.',
              'It is normal for healthy gums to bleed when brushing teeth.',
              'Dry mouth can have a negative effect on overall oral health.',
              'Snacking throughout the day can have a negative impact on oral health.',
              'I believe I have previously received adequate training to help provide the best oral care possible to residents under my care.',
              'I believe residents under my care have oral health care needs which require further training to adequately understand and help manage.',
              'I believe I have effective techniques which I use to brush the teeth of residents under my care.',
              'I feel comfortable assisting residents in the safe use of fluoridated dental products (such as fluoridated toothpaste).',
              'I am able to confidently recognize non-verbal signs of pain in residents under my care.',
              'I would be interested in receiving additional training to help maintain the oral health of residents under my care.',
              'I feel confident that I have the knowledge to identify when residents under my care experience oral pain.',
              'Approximately, what percentage of residents under your care require assistance brushing or flossing their teeth?',
              'Approximately, what percentage of residents under your care experience bleeding when brushing their teeth?',
              'Approximately, what percentage of residents under your care experience bleeding when flossing their teeth?',
              'Approximately, what percentage of residents under your care express that they experience pain when brushing their teeth?',
              'Approximately, what percentage of residents under your care experience pain when flossing their teeth?',
              'Approximately, what percentage of residents under your care express that they experience dental pain throughout the day when they are not brushing or flossing their teeth?',
              'Approximately, how often do residents under your care go to the dentist?',
              'Approximately, how often do residents under your care go to the dentist? - Other',
              'On average, how often do most residents under your care brush their teeth?',
              'On average, how often do most residents under your care floss their teeth?',
              'How often should residents brush their teeth each day?',
              'How often should residents floss their teeth each day?',
              'Approximately, how often do residents under your care have snacks throughout the day between brushing and flossing their teeth?',
              'What is the biggest obstacle to providing excellent oral care to residents?',
              'What is the biggest obstacle to providing excellent oral care to residents? - Other',
              'What is your primary source of dental-related information?',
              'What is your primary source of dental-related information? - Other',
              'What is the biggest obstacle to receiving proper oral health care training in your facility?',
              'What is the biggest obstacle to receiving proper oral health care training in your facility? - Other',
              'Which resource would be most useful to help improve your confidence in delivering excellent oral homecare to residents under your care?',
              'Which resource would be most useful to help improve your confidence in delivering excellent oral homecare to residents under your care? - Other',
              'RandomID']

df.columns

## Drop unnecessary columns

In [None]:
# drop unnecessary columns
show = df.drop(['Start Date',
                'End Date', 
                #'Response Type', 
                'IP Address', 
                'Progress', 
                #'Duration (seconds)', 
                #'Finished', 
                'Recorded Date',
                'Response ID',
                'Recipient Last Name',
                'Recipient First Name', 
                'Recipient Email', 
                'External Data Reference',
                'Distribution Channel', 
                'User Language'], axis=1)
show.head()

## Drop test responses

In [None]:
# identify the test responses
show[(show['Response Type']=='Survey Preview') | (show['Response Type']=='Survey Test')]

In [None]:
# first 7 rows are test responses and need to be dropped from data
# drop first 7 rows and reset the index
respondents = show[8:].reset_index(drop=True)
respondents.head()

In [None]:
# drop `Response Type`
respondents = respondents.drop(['Response Type'], axis=1)
respondents.head()

## Drop incomplete responses

In [None]:
# filter out incomplete responses
completed = respondents[respondents['Finished']=='True'].reset_index(drop=True)
completed

In [None]:
respondents.shape

In [None]:
show.shape

## View distribution of values in each question

In [None]:
# drop
# view counts in each column
for column in completed.drop(['Location Latitude', 
                              'Location Longitude',
                              'Duration (seconds)',
                              'Finished'], axis=1).columns:
    print("-"*60)
    print(f"COLUMN: '{column}'")
    print(f"UNIQUE VALUES: {len(completed[column].unique())}")
    print("- "*30)
    print(completed[column].value_counts())
    print("-"*60)

### Distribute columns into new categories
1. **DEMOGRAPHICS** - these questions are about the respondents and will not change after training.
2. **SELF REPORTING** - these questions offer insights into the respondents' wants, needs, and personal experiences, but *they will not be impacted by the training.*
3. **HYPOTHESIS** - _**these questions will measure the impact of the training material and support or reject the hypothesis.**_

In [None]:
#distribute columns into new categories: ['demographics', 'hypothesis', 'self_reporting']

demographics = ['ChosenID', 
                'RandomID',
                'Location Latitude', 
                'Location Longitude', 
                'What is your primary language?', 
                'What is your primary language? - Other', 
                'Are you a:', # position/relevant need for training
                'Are you a: - Other',
                'Which of the following best describes your Race/Ethnicity?',
                'Do you identify as:', #gender
                'Which of the following best describes the area you live in?']

hypothesis = ['ChosenID', 
              'RandomID',
              'Fluoridated products, such as fluoridated toothpaste and fluoridated water, can help improve the oral health of residents.',
              'It is normal for healthy gums to bleed when brushing teeth.',
              'Dry mouth can have a negative effect on overall oral health.',
              'Snacking throughout the day can have a negative impact on oral health.',
              'I believe I have effective techniques which I use to brush the teeth of residents under my care.',
              'I believe I have previously received adequate training to help provide the best oral care possible to residents under my care.',
              'I believe residents under my care have oral health care needs which require further training to adequately understand and help manage.',
              'I feel comfortable assisting residents in the safe use of fluoridated dental products (such as fluoridated toothpaste).',
              'I am able to confidently recognize non-verbal signs of pain in residents under my care.',
              'I feel confident that I have the knowledge to identify when residents under my care experience oral pain.',
              'Approximately, what percentage of residents under your care express that they experience pain when brushing their teeth?',
              'Approximately, what percentage of residents under your care experience pain when flossing their teeth?',
              'Approximately, what percentage of residents under your care express that they experience dental pain throughout the day when they are not brushing or flossing their teeth?',
              'How often should residents brush their teeth each day?', 
              'How often should residents floss their teeth each day?']

self_reporting = ['ChosenID', 
                  'RandomID',
                  'I would be interested in receiving additional training to help maintain the oral health of residents under my care.', 
                  'Approximately, what percentage of residents under your care require assistance brushing or flossing their teeth?', 
                  'Approximately, what percentage of residents under your care experience bleeding when brushing their teeth?',
                  'Approximately, what percentage of residents under your care experience bleeding when flossing their teeth?', 
                  'Approximately, how often do residents under your care go to the dentist?', 
                  'Approximately, how often do residents under your care go to the dentist? - Other',
                  'On average, how often do most residents under your care brush their teeth?', 
                  'On average, how often do most residents under your care floss their teeth?',
                  'Approximately, how often do residents under your care have snacks throughout the day between brushing and flossing their teeth?',
                  'What is the biggest obstacle to providing excellent oral care to residents?',
                  'What is the biggest obstacle to providing excellent oral care to residents? - Other',
                  'What is your primary source of dental-related information?',
                  'What is your primary source of dental-related information? - Other',
                  'What is the biggest obstacle to receiving proper oral health care training in your facility?',
                  'What is the biggest obstacle to receiving proper oral health care training in your facility? - Other',
                  'Which resource would be most useful to help improve your confidence in delivering excellent oral homecare to residents under your care?',
                  'Which resource would be most useful to help improve your confidence in delivering excellent oral homecare to residents under your care? - Other',
                  'Before today, I have received training on how to provide direct oral health care for individuals with special health care needs',
                  'Before today, I have received training on how to provide direct oral health care for individuals with special health care needs - Yes. If you answered YES to a previous training, please describe here:']

In [None]:
# make sure I have all the columns:
for i in completed.columns:
    if i not in demographics:
        if i not in hypothesis:
            if i not in self_reporting:
                print(i) # should only output 'Finished' and 'Duration (seconds)'

In [None]:
for i in completed.columns:
    if (
        i not in demographics and 
        i not in hypothesis and
        i not in self_reporting
    ):
        print(i) # should only output 'Finished' and 'Duration (seconds)'

## Minutes to complete survey

In [None]:
completed['Duration (seconds)'] = completed['Duration (seconds)'].astype(int)

time = completed[['Duration (seconds)']]
time['mins'] = time.loc[:,'Duration (seconds)']/60
time

In [None]:
# pickle time df
pd.to_pickle(time, "../saved_data_frames/time_df.pkl")

# Set up visualization of completed surveys

In [None]:
print(f'Total: {len(respondents)}')
print(f'Completed: {len(completed)}')

In [None]:
totals = pd.DataFrame([len(respondents),len(completed)], columns=['count'], index=['Total', 'Completed'])
totals

In [None]:
# pickle totals df
pd.to_pickle(totals, "../saved_data_frames/totals_df.pkl")

## DEMOGRAPHICS subdivision

### Create new demographics df

In [None]:
demo_df = completed[demographics]
demo_df

### Simplify column names

In [None]:
demo_df = demo_df.rename(columns={'Which of the following best describes the area you live in?': 'Community Type',
                                  'Location Latitude':'Lat',
                                  'Location Longitude':'Long',
                                  'Which of the following best describes your Race/Ethnicity?':'Race/Ethnicity',
                                  'Do you identify as:':'Gender'})

In [None]:
demo_df

### Questions with paried Other responses

In [None]:
demo_df[['What is your primary language?',
          'What is your primary language? - Other',
          'Are you a:',
          'Are you a: - Other',]]

#### Create `Primary Language` column

In [None]:
demo_df['What is your primary language? - Other'].value_counts()

In [None]:
demo_df['What is your primary language?'].value_counts()

In [None]:
primary_language = []
for i in enumerate(completed['What is your primary language?']):
    #print(i)# i is tuple (index, value)
    if i[1] == 'Other (please specify):':
        primary_language.append(completed['What is your primary language? - Other'][i[0]].capitalize())
    else: 
        primary_language.append(i[1].capitalize())
        
set(primary_language)

In [None]:
demo_df['Primary Language'] = primary_language
demo_df.drop(['What is your primary language? - Other',
              'What is your primary language?'], 
             axis=1, 
             inplace=True)
demo_df

#### Create `Training Relevance` column

In [None]:
demo_df['Are you a:'].value_counts()

In [None]:
demo_df['Are you a: - Other'].value_counts()

In [None]:
positions = []
for i in enumerate(demo_df['Are you a:']):
    #print(i)# i is tuple (index, value)
    if i[1] == 'Other (please specify):':
        positions.append(demo_df['Are you a: - Other'][i[0]])
    else: 
        positions.append(i[1])
        
set(positions)

In [None]:
for i in enumerate(positions):
    if i[1] == 'None':
        positions[i[0]] = 'Not Specified'
    elif i[1] == 'none':
        positions[i[0]] = 'Not Specified'
    elif i[1] == 'no':
        positions[i[0]] = 'Not Specified'
    elif i[1] == np.nan:
        positions[i[0]] = 'Not Specified'
set(positions)

In [None]:
demo_df['Training Relevance'] = positions
demo_df.drop(['Are you a: - Other',
              'Are you a:'], 
             axis=1, 
             inplace=True)

demo_df

In [None]:
demo_df['Training Relevance'].isnull().sum()

In [None]:
demo_df['Training Relevance'] = demo_df['Training Relevance'].fillna("Not Specified")

In [None]:
demo_df

In [None]:
demo_df['Training Relevance'].value_counts()

In [None]:
demo_df['Training Relevance'] = [i.capitalize() for i in demo_df['Training Relevance']]

In [None]:
demo_df['Training Relevance'].isnull().sum()

### Save demographics data

In [None]:
# create seperate lat long data, not associated with responses
demo_df['lat,long'] = list(zip(demo_df['Lat'], demo_df['Long']))

lat_long = pd.DataFrame(demo_df['lat,long'].value_counts()).reset_index()
lat_long.columns = ['(lat,long)','count']
pd.to_pickle(lat_long, "../saved_data_frames/lat_long_df.pkl")

In [None]:
# finalize demographics df and pickle
demographics_df = demo_df.drop(['Lat', 'Long', 'lat,long'], axis=1)
pd.to_pickle(demographics_df, "../saved_data_frames/demographics_df.pkl")

## SELF_REPORTING subdivision

### Create new self-reporting df

In [None]:
sr_df = completed[self_reporting]
sr_df.columns

In [None]:
# move questions from columns names to make this easier
sr_easy = sr_df.T.reset_index().T
sr_easy.head(3)

View the question in this section

In [None]:
for question in enumerate(sr_easy.loc['index']):
    print(question)

### Questions with paried Other responses

Extract just the questioins with a paired "Other" response. 

Pairs columns are listed below:
- 6 & 7: Approximately, how often do residents under your care go to the dentist?
- 11 & 12: What is the biggest obstacle to providing excellent oral care to residents?
- 13 & 14: What is your primary source of dental-related information
- 15 & 16: What is the biggest obstacle to receiving proper oral health care training in your facility?
- 17 & 18: Which resource would be most useful to help improve your confidence in delivering excellent oral homecare to residents under your care?
- 19 & 20: Before today, I have received training on how to provide direct oral health care for individuals with special health care needs. (20, if Yes please describe)

#### Consolidate Function

In [None]:
def consolidate(df, col1, col2, Other_string,  convert_dict):
    # fill null values
    df[col2] = df[col2].fillna("Not Specified")
    
    # empty list to hold new column values
    new_col_values = []
    
    # consolidate into the list
    for i in df.index:
        response = df[col1][i]
        if response == Other_string:
            new_col_values.append(df[col2][i])
        else: 
            new_col_values.append(response)
        
    # add new column
    new_row = df.columns[-1] + 1
    df[new_row]= new_col_values
    
    # convert with dict
    df[new_row] = df[new_row].replace(convert_dict)
    
    # drop original rows
    df = df.drop([col1, col2], axis=1)
    
    return df

#### Consolidate 6 & 7 responses: "Approximately, how often do residents under your care go to the dentist?"  

In [None]:
sr_easy[6].value_counts()

In [None]:
sr_easy[7].value_counts()

In [None]:
convert = {"not sure": "Unknown", 
           "unknown": "Unknown",
           "individuals under my care are often very non-compliant": "Difficult due to non-compliance",
           'scheduled dental van visits':"Scheduled dental van visits",
           "depends on how much dental work needs to done so 3 to 6 months ": "Once every 6 months",
           "Never, kids are under 5 years old. Parents have not taken them or have not found a dentist for them.":"N/A, children are too young",
           "it is supposed to be 4-6 months but it onthe dental van hard to get a spot ":"Scheduled dental van visits",
           "i dont remember once a year at least":"Once every 12 months",
           "dental care for the students in my care don't happen nearly as often as it should. More than 2 years will go by without them being seen even if it's just for a cleaning":"Less than once every 12 months",
           "Due to lack of providers, some residents have not seen the dentist in over a year despite needing to":"Less than once every 12 months",
           "Not certain. Between every 6 months and 1x a year":"Once every 12 months",
           "Don't know.": "Unknown",
           "I don't have this information":"Unknown",
           "I am not sure, I work per diem":"Unknown",
           "Not aware of that":"Unknown"}

In [None]:
sr_easy = consolidate(sr_easy, 6, 7, 'Other (please describe):', convert)

In [None]:
sr_easy[21].value_counts()

#### Consolidate 11 & 12 responses: "What is the biggest obstacle to providing excellent oral care to residents?"  

Combine the paired responses in the Other column in new column

In [None]:
sr_easy[11].value_counts()

In [None]:
sr_easy[12].value_counts()

In [None]:
convert_2 = {" ?":"Not Specified",
             "non-compliance":"Residents' specific behavioral needs",
             "none":"Not Specified",
             "student not wanting to brush there teeth sometimes you can and sometimes you cant and also the student not knowing how to spit out the toothpaste ":"Other",
             "some staff just lazy and dirty, if they dont brush there own teeth do you think they are going to brush our individuals teeth?":"Other",
             "Financial restrictions which limit access to proper oral health care":"Financial restrictions"
            }

In [None]:
# save Other responses
other_barriers = [
    
    "Student's not wanting to brush their teeth. Sometimes you can and sometimes you can't. Also, the student not knowing how to spit out the toothpaste.",
    "Some staff are just lazy and dirty. If they don't brush their own teeth, do you think they are going to brush our individuals' teeth?"
]

In [None]:
sr_easy = consolidate(sr_easy, 11, 12, 'Other (please describe):', convert_2)

In [None]:
sr_easy[sr_easy.columns[-1]].value_counts()

#### Consolidate 13 & 14 responses: "What is your primary source of dental-related information"  

In [None]:
sr_easy[13].value_counts()

In [None]:
sr_easy[14].value_counts()

In [None]:
convert_3 = {"none":"Not Specified",
             "individuals's plan of care":"Individuals's plan of care",
             "Previous healthcare training p":"Previous healthcare training",
             "agency and nursing support":"Agency and nursing support",
             "Dental professionals such as dentist, dental hygienist, dental assistants" : "Dental professionals",
             "Internet and social media sites such as Google, YouTube, Twitter, Facebook, etc.":"Internet and social media",
             "Academic sources such as research papers and research journal articles":"Academic sources"}

In [None]:
sr_easy = consolidate(sr_easy, 13, 14, 'Other (please describe):', convert_3)

In [None]:
sr_easy[sr_easy.columns[-1]].value_counts()

#### Consolidate 15 & 16 responses: "What is the biggest obstacle to receiving proper oral health care training in your facility?"  

In [None]:
sr_easy[15].value_counts()

In [None]:
sr_easy[16].value_counts()

In [None]:
convert_4 = {"not residential staff": "N/A - Not a residential staff",
             "not sure":"Not Specified",
             "none":"There are no obstacles to receiving proper oral health care training in my facility",
             "uncertain if any is provided":"Lack of resources for teaching proper oral health care training",
             "na":"Not Specified",
             "individual's non-compliance":"The individual's non-compliance"}

In [None]:
sr_easy = consolidate(sr_easy, 15, 16, 'Other (please describe):', convert_4)

In [None]:
sr_easy[sr_easy.columns[-1]].value_counts()

#### Consolidate 17 &  18 responses: " Which resource would be most useful to help improve your confidence in delivering excellent oral homecare to residents under your care?"

In [None]:
sr_easy[17].value_counts()

In [None]:
sr_easy[18].value_counts()

In [None]:
convert_5 = {"not sure":"Not Specified"}

In [None]:
sr_easy = consolidate(sr_easy, 17, 18, 'Other (please explain):', convert_5)

In [None]:
sr_easy[sr_easy.columns[-1]].value_counts()

#### Consolidate 19 &  20 responses: "Before today, I have received training on how to provide direct oral health care for individuals with special health care needs."

In [None]:
sr_easy[19].value_counts()

In [None]:
sr_easy[20].fillna("Yes - Not Specified", inplace=True)

In [None]:
sr_easy[20].value_counts()

In [None]:
convert_6 = {"Yes": "Yes - Not Specified" ,
             
             "1998 at a former employer": "Yes - Current or previous employment" ,
             
             "very good ": "Yes - Not Specified" ,
             
             "training with anderson": "Yes - Current or previous employment",
             
             "Nursing at AEC": "Yes - Current or previous employment",
             
             "on site training": "Yes - Current or previous employment",
             
             "staff":"Yes - Current or previous employment",
             
             "N/a": "Yes - Not Specified",
             
             "Worked in healthcare": "Yes - Current or previous employment",
             
             "ACA":"Yes - Not Specified",
             
             "past employment":"Yes - Current or previous employment",
             
             "training from nursing " : "Yes - Current or previous employment",
             
             "at on-the-job training ":"Yes - Current or previous employment",
             
             "no":"No",
             
             "activities of daily living":"Yes - Not Specified",
             
             "Yes I have": "Yes - Not Specified" ,
             
             "In person/Task Analysis ":"Yes - Current or previous employment"}

In [None]:
sr_easy = consolidate(sr_easy, 19, 20, 
                      'Yes. If you answered YES to a previous training, please describe here:', 
                      convert_6)

In [None]:
sr_easy[sr_easy.columns[-1]].value_counts()

### Reset columns names

In [None]:
new_header = sr_easy.iloc[0] #grab the first row for the header
self_reporting = sr_easy[1:] #take the data less the header row
self_reporting.columns = new_header #set the header row as the df header

In [None]:
self_reporting.head()

### Save demographics data

In [None]:
pd.to_pickle(self_reporting, "../saved_data_frames/self_reporting_df.pkl")

## HYPOTHESIS subdivision

### Create new evaluations df

In [None]:
hypothesis

In [None]:
hypo_df = completed[hypothesis]

In [None]:
list(enumerate(hypo_df.columns))

#### Divide into KNOWLEDGE and ATTITUDE



**KNOWLEDGE** - 
[2,3,4,5,15,16,12,13,14]


These questions have correct responses
- Fluoridated products, such as fluoridated toothpaste and fluoridated water, can help improve the oral health of residents.
- It is normal for healthy gums to bleed when brushing teeth.
- Dry mouth can have a negative effect on overall oral health.
- Snacking throughout the day can have a negative impact on oral health.
- How often should residents brush their teeth each day?
- How often should residents floss their teeth each day?

These questions are ordinal with no correct response
- Approximately, what percentage of residents under your care express that they experience pain when brushing their teeth?
- Approximately, what percentage of residents under your care experience pain when flossing their teeth?
- Approximately, what percentage of residents under your care express that they experience dental pain throughout the day when they are not brushing or flossing their teeth?

**ATTITUDE** - 
[6,7,8,9,11,10]
- I believe I have previously received adequate training to help provide the best oral care possible to residents under my care.
- I believe residents under my care have oral health care needs which require further training to adequately understand and help manage.
- I believe I have effective techniques which I use to brush the teeth of residents under my care.
- I feel comfortable assisting residents in the safe use of fluoridated dental products (such as fluoridated toothpaste).
- I feel confident that I have the knowledge to identify when residents under my care experience oral pain.
- I am able to confidently recognize non-verbal signs of pain in residents under my care.

### Save HYPOTHESIS data

In [None]:
pd.to_pickle(hypo_df, "../saved_data_frames/hypothesis_df.pkl")

# END

In [None]:
# save the cleaned pre training data

In [None]:
# load all pre-training data
demo_df = pd.read_pickle("../saved_data_frames/demographics_df.pkl").drop(["ChosenID","RandomID"], axis=1)
self_reporting = pd.read_pickle("../saved_data_frames/self_reporting_df.pkl")
eval_df = pd.read_pickle("../saved_data_frames/hypothesis_df.pkl").drop(["ChosenID","RandomID"], axis=1)

cleaned_pre = pd.concat([self_reporting,demo_df,eval_df], axis=1)
pd.to_pickle(cleaned_pre, "../saved_data_frames/cleaned_pre.pkl")