# Checkpoint Two: Exploratory Data Analysis

Now that your chosen dataset is approved, it is time to start working on your analysis. Use this notebook to perform your EDA and make notes where directed to as you work.

## Getting Started

Since we have not provided your dataset for you, you will need to load the necessary files in this repository. Make sure to include a link back to the original dataset here as well.

My dataset: https://catalog.data.gov/dataset/alzheimers-disease-and-healthy-aging-data

Your first task in EDA is to import necessary libraries and create a dataframe(s). Make note in the form of code comments of what your thought process is as you work on this setup task.

In [None]:
# Pre-filtered the overall excel file down to the two questions I want to analyze from the survey data.
# So the file I'm importing is already a filtered dataset from the main dataset.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

df = pd.read_csv("./filtered-dataset.csv")

# Looks like 0 duplicated rows
df.duplicated().sum()

df.columns

# Columns




list = ['RowId', 'YearStart', 'YearEnd', 'LocationAbbr', 'LocationDesc',
       'Datasource', 'Class', 'Topic', 'Question', 'Data_Value_Unit',
       'DataValueTypeID', 'Data_Value_Type', 'Data_Value', 'Data_Value_Alt',
       'Data_Value_Footnote_Symbol', 'Data_Value_Footnote',
       'Low_Confidence_Limit', 'High_Confidence_Limit',
       'StratificationCategory1', 'Stratification1', 'StratificationCategory2',
       'Stratification2', 'Geolocation', 'ClassID', 'TopicID', 'QuestionID',
       'LocationID', 'StratificationCategoryID1', 'StratificationID1',
       'StratificationCategoryID2', 'StratificationID2']



In [39]:
column_dict = {
	'question': "Question",
	"location_abbrev": "LocationAbbr",
	"location_desc": "LocationDesc",
	"strat_cat_1": "StratificationCategory1",
	"strat_1_value": "Stratification1",
	"strat_cat_2": "StratificationCategory2",
	"strat_2_value": "Stratification2",
	"df_footnote": "Data_Value_Footnote",
	"id": 'RowId',
	"dv_type": "Data_Value_Type"
	}

## Get to Know the Numbers

Now that you have everything setup, put any code that you use to get to know the dataframe and its rows and columns better in the cell below. You can use whatever techniques you like, except for visualizations. You will put those in a separate section.

When working on your code, make sure to leave comments so that your mentors can understand your thought process.

In [None]:
# Is each row of my dataset an individual data point, or is it representing an aggregation (average, sum, etc.) of a dataset.
	## Each row represents an aggregation of survey responses for a given geographic region


In [None]:
## Unique locations
## Are there regions mixed in with specific states in the data? -- YES

unique_vals = df[column_dict['location_desc']].unique()
unique_vals.sort()

print(unique_vals)


In [None]:
## Unique Age groups, Race/Ethnicity, Gender reported on for these two questions
# filtered_df = df[df['City'] == 'New York']
category = "Age Group"

filtered_to_cat = df[df[column_dict['strat_cat_1']] == category]

filtered_to_cat[column_dict['strat_1_value']].unique()



In [None]:
## Unique Race/Ethnicity, 

category = 'Race/Ethnicity'

filtered_to_cat = df[df[column_dict['strat_cat_2']] == category]

filtered_to_cat[column_dict['strat_2_value']].unique()

In [None]:
## Unique Gender reported on for these two questions
category = 'Gender'

filtered_to_cat = df[df[column_dict['strat_cat_2']] == category]

filtered_to_cat[column_dict['strat_2_value']].unique()

In [28]:
## Number of 'No Data Available' rows in the dataset for these questions

row_val = "No Data Available"

count_no_data = df[df[column_dict['df_footnote']] == row_val][column_dict['df_footnote']].count()

In [43]:
## Number of rows with no data due to sample size warning
row_val = "Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%"

count_sample_warning = df[df[column_dict['df_footnote']] == row_val][column_dict['df_footnote']].count()


total_no_data = count_no_data + count_sample_warning
print(total_no_data)

num_total_records = df[column_dict['id']].count()
print(num_total_records)

percent_missing = round((total_no_data / num_total_records) * 100, 2)
print(percent_missing)

# Total rows with data
print(num_total_records - total_no_data)
# when I move on to cleaning data, going to filter these no data rows out of dataset

6095
16287
37.42
10192


In [40]:
# Value data types in dataset

df[column_dict['dv_type']].unique()

array(['Percentage'], dtype=object)

In [None]:
## <Important> Need to research what these High and Low confidence values mean

# A low confidence limit (often referred to as the lower bound) is the lowest value in this range, 
##  while a high confidence limit (or upper bound) is the highest value. 
## Together, these limits form a confidence interval, 
## which provides an estimate of uncertainty around the parameter.

In [51]:
## Unique Location values for rows with data and rows without data

row_val_1 = "No Data Available"
row_val_2 = "Sample size of denominator and/or age group for age-standardization is less than 50 or relative standard error is more than 30%"

exclude = [row_val_1, row_val_2]


filtered_df = df[~df[column_dict['df_footnote']].isin(exclude)]
# Locations with data
locations_with_data = filtered_df[column_dict['location_desc']].unique()

no_data_df = df[df[column_dict['df_footnote']].isin(exclude)]
locations_without_data = no_data_df[column_dict['location_desc']].unique()


print(len(locations_with_data))
print(len(locations_without_data))

# All unique locations seem to have rows with data and rows without data


59
59


In [None]:
## See how filled out dataset is for both questions. Does one have more data than the other? That'd be an important thing to note in final 
### Tableau report if so. 
## What are the number of complete responses for each question? Can I compare them, or is one question much more answered than the other?



## Visualize

Create any visualizations for your EDA here. Make note in the form of code comments of what your thought process is for your visualizations.

## Summarize Your Results

With your EDA complete, answer the following questions.

1. Was there anything surprising about your dataset? 
2. Do you have any concerns about your dataset? 
3. Is there anything you want to make note of for the next phase of your analysis, which is cleaning data? 