## Author: Ben Cliff

# About the data  
##### The dataset loaded into this notebook is a .csv file obtained from the CDC's preventative measures site.  
##### The sample subject of this dataset is U.S. citizens who are 65 years or older. The sample is subdivided by gender, State and city.
##### The target or signal of this dataset is the percentage of citizens (broken down by gender and location) who are reported to have taken preventative measures such as immunizations and select cancer screenings against common illnesses.
##### [Bethlehem insert commentary here about what trends we want to see]  

# Purpose of this notebook:
##### Create a clean .csv file to work from for further analysis
##### Perform preliminary analysis as well as diagnostics of the data
##### Create visualizations from this notebook for upload onto our blog site

# Reading and cleaning the dataset

In [None]:
# Importing powerful data manipulation library
import pandas as pd
import numpy as np
import geopandas as gpd
from geopandas import GeoDataFrame
# import shapely
from shapely.geometry import Point

In [None]:
# Reading in prevention data
prevent_df = pd.read_csv('../data/preventativedata.csv')

# Removing with little to no information
prevent_df = prevent_df.drop(columns=['Data_Value_Unit', 'Data_Value_Footnote_Symbol', 'Data_Value_Footnote', 'TractFIPS', 'CategoryID', 'StateDesc', 'Data_Value_Type', 'DataSource',
'DataValueTypeID', 'Category'])

In [None]:
# Removing undescores
prevent_df.columns = prevent_df.columns.str.replace('_', '')
# prevent_df.columns

In [None]:
# Sorting values by unique ID; creating new index and dropping the old one
prevent_df.sort_values(by=['UniqueID'], ascending=True).reset_index().head(3).drop(columns=['index'])

In [None]:
temp_list = []

longs = []

prevent_df['GeoLocation'] = [x.replace('(', '') for x in prevent_df['GeoLocation']]
prevent_df['GeoLocation'] = [x.replace(')', '') for x in prevent_df['GeoLocation']]
prevent_df['GeoLocation'] = [x.split(',') for x in prevent_df['GeoLocation']]
#prevent_df['GeoLocation'] = [float(x) for x in prevent_df['GeoLocation']]


In [None]:
lats = [float(x[0]) for x in prevent_df['GeoLocation']]
longs = [float(x[1]) for x in prevent_df['GeoLocation']]

In [None]:
prevent_df['lats'] = lats
prevent_df['longs'] = longs
# prevent_df.head(5)

# Data Visualization and Analysis

In [None]:
# Importing visualization tool
import plotly.graph_objects as go
import plotly.express as px

In [None]:
prevent_df['PopulationCount'] = [x.replace(',', '') for x in prevent_df['PopulationCount']]
prevent_df['PopulationCount'] = prevent_df['PopulationCount'].astype(int)

In [None]:
# Adding new column to prevent_df: top 5 states for geospatial analysis
temp_list = []
for x in prevent_df['StateAbbr']:
    if x == 'CA' or x == 'TX' or x == 'FL' or x == 'IL' or x == 'MI':
        temp_list.append('yes')
    else:
        temp_list.append('no')

In [None]:
prevent_df['top5state'] = temp_list
# prevent_df.head(10)

In [None]:
# Extracting only the male records of the dataset to get the unique counts of surveys by cities
men_df = prevent_df.loc[prevent_df['MeasureId'] == 'COREM']
women_df = prevent_df.loc[prevent_df['MeasureId'] == 'COREW']

# Isolating the top 5 counts of cities in separate dataframe
top_5_states = men_df.loc[(men_df['StateAbbr'] == 'CA') | (men_df['StateAbbr'] == 'TX') | (men_df['StateAbbr'] == 'FL')
| (men_df['StateAbbr'] == 'IL') | (men_df['StateAbbr'] == 'MI')]

# Isolating all other 45 states in separate dataframe
other_states_df = men_df.loc[(men_df['StateAbbr'] != 'CA') & (men_df['StateAbbr'] != 'TX') & (men_df['StateAbbr'] != 'FL')
& (men_df['StateAbbr'] != 'IL') & (men_df['StateAbbr'] != 'MI')]

In [None]:
# Creating separate data for the histogram counts
x = other_states_df['StateAbbr']
y = top_5_states['StateAbbr']

# Creating graph object
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc='count', x=x, name='State Participants'))
fig.add_trace(go.Histogram(histfunc='count', x=y, name='Top 5 States represented', marker_color='#330C73'))
fig.update_layout(title_text='Cities Count of Survey', xaxis_title_text='State Abbreviation', yaxis_title_text='Count')
fig.show()

In [None]:

geometry = [Point(xy) for xy in zip(lats, longs)]
gdf = GeoDataFrame(prevent_df, geometry=geometry)

world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
usa = world.loc[world['name'] == 'United States of America']

fig = px.scatter_geo(men_df, lat=men_df['lats'], lon=men_df['longs'], width=800, height=400, color=men_df['top5state'])
fig.update_layout(title = 'PLACES survey locations', geo_scope='usa', showlegend=False)
fig.show()

In [None]:
# Cross tabulation of the Data by State and the Population Count of each survey location
new_df = pd.crosstab(index=[men_df['StateAbbr'], pd.cut(men_df['PopulationCount'], [0, 50000, 100000, 250000, 500000, 1000000])], columns=men_df['PopulationCount'], margins=True, margins_name='Counts')

In [None]:
# Find a state's location site's population by state abbreviation
new_df.loc['TX'].sort_values('PopulationCount')

In [None]:
new_df.loc['CA'].sort_values('PopulationCount')

## Working Hypothesis
- Direction for analysis: Geospatial -> why are certain states recieving more attention than others?
- Can we make comparison of the female and male participants of this study -> are more women taking the CDC preventative measures? Are they not? Are they the same?

In [None]:
agg_men = men_df.groupby('StateAbbr')
agg_women = women_df.groupby('StateAbbr')

In [None]:
state_counts = men_df.groupby('StateAbbr').count().Year

In [None]:
state_counts.describe()

# Hypothesis Testing

In [None]:
# Plot data for men, then women to see if t-test is appropriate for testing hypothesis
fig = px.histogram(prevent_df, 'DataValue', color='MeasureId', marginal='box', barmode='overlay', width=1000, height=500)
fig.update_layout(title='Overlaid Histogram of Prevention Rates by Sex', xaxis_title_text = 'Prevention Rates (%)')
fig.show()

The histogram above is more informative than expected prior to its creation. Ultimately, it shows that the data is normally distributed enough for a student t-test to test the null hypothesis that the male and female subgroups have identical average values. In this case, the averages of the rates that the groups succeed in satisfying the CDC's preventative measures requirements. Furthermore, the data shows that the male participants generally meet the CDC preventative measures guidelines than the female participants.

COREW - Signifies females  
COREM - Signifies males

In [None]:
# Grouped data by gender, then state
# Taken arithmetic mean of the DataValue column
# Looking at DataValue for man and women by state 

mean_for_men = agg_men.mean()['DataValue']
mean_for_women = agg_women.mean()['DataValue']

In [None]:
# Importing statistics functions from scipy for the hypothesis testing
from scipy.stats import ttest_ind
from scipy.special import logsumexp

In [None]:
# Creating list objects out of series objects
men_list = list(mean_for_men)
women_list = list(mean_for_women)

In [None]:
# Creating function that iterates over each item of a list and inserts the log sum of exponential of item
# Function returns the newly generated list
def logsums(means):
    a_list = []
    for x in range(len(means)):
        a_list.append(logsumexp(means[x]))
    return a_list

In [None]:
logs_men = logsums(mean_for_men)
logs_women = logsums(mean_for_women)

In [None]:
# Performing student t-test against the aggregated female and male groups' prevention rates
t_stat, p = ttest_ind(logs_men, logs_women)
print(f't={t_stat}, p={p}')

The p-value for this student t-test < 0.05. Therefore, we fail to reject the null hypothesis that the averages of the rates of men and women satisfying the CDC's preventative measures guidelines are identical.