# Capstone 1 Exploratory Data Analysis and Inferential Statistics Report

This project is a pilot study for a network of alcoholism treatment centers to see if the severity of winter weather, as measured by the Accumulated Winter Season Severity Index (AWSSI), could be used to project patient demand. The study is based on the hypothesis that severe winter weather may exacerbate symptoms of alcoholism, leading more people to seek treatment, and thereby increase demand at treatment centers.

### Feature data
The AWSSI data -- the feature in this analysis -- was measured at the Blue Hill, Massachusetts, weather reporting station from 1992 to 2014. The AWSSI data included the following elements:

* AWSSI, the overall measure of winter severity
* Snow score, based on frequency and amount of snowfall 
* Temperature score, based on frequency and degree of cold temperatures
* Length of winter in days
* Start date for winter, derived not from the calendar designation but the first day that certain winter weather conditions were recorded.
*  End date for winter, derived not from the calendar designation but the last day that certain winter weather conditions were recorded.

The final two features were used to engineer two new features, which indicate how early and how late, in terms of days from the winter solstice (December 21), the start and end of winter occurred. For both scales, greater positive values indicated more: either a winter that started earlier or one that ended later. 

### Target data
Massachusetts alcoholism admissions case data from 1992 to 2014 were collected from the federal government’s “Treatment Episode Data Set: Admissions” (TEDS-A). Case data included demographic characteristics for each admission. Demographic groups were included in the analysis. Some were derived by combining smaller subgroups, as described in the Data Wrangling report. Below are the groups that were analyzed: 

* Total number of admissions
* Gender: male, female
* Race: white, black
* Education: Without a high school degree, high school graduate, attended college
* Marital status: Never married, married now or in the past
* Employment: working or not working

### Shape of target data
I made a histogram of the overall target variable, the total admissions by year for alcoholism treatment. With only 23 data points, it's hard to generalize from the histogram, but there were 10 years when Massachusetts saw between 30,000 and 34,999 patient admissions for alcoholism. The range on the low end of the histogram was greater than on the high end.

### Scatter plots
A scatterplot of the three features relating to winter in terms of time (length, start date, and end date) were plotted on a single chart. With the sparse data, no clear patterns were evident, other than there seemed to be a slightly negative relationship between number of days (winter length, earlier start, later end) and the number of people admitted for alcoholism. Since length increases along with greater values for the other scale, its distribution on the scatter plot sat above the other two. This initial scatter plot did not support the premise of the pilot study, that more people suffer from alcoholism as measures of winter severity increase.

Ten additional scatterplots buttressed this observation. The three time-related winter severity measures above were plotted individually against total alcoholism admissions to see if trends were more evident when plotted separately, but the findings were the same. The AWSSI and its two component scales (snow and temperature) were plotted against total alcoholism patient admission, and total admissions were plotted against demographic breakdowns with the biggest Ns: male, female, black and white. The relationships between features and targets were consistent: weak negative relationship, i.e., fewer people admitted for alcoholism in years where winter was more severe.

### Stacked bar charts overlaid with line charts

When stacked bar charts showing the components of the demographic classifications (i.e., female stacked on male, black stacked on white) were overlaid with line charts of the winter severity index and its components, there was additional evidence of a weak negative relationship between alcoholism and winter severity. One dip in admissions between 2001 and 2006 tracked moderately with a spike in winter severity for the years 2003 to 2005. A dip in winter severity for the years 2006 to 2008 coincided with an increase in admissions for those years.

These stacked alcoholism demographic bar charts overlaid with winter severity line charts showed other things: 

* Changes in the demographic composition of alcoholism patients across years did not demonstrate patterns tied to winter severity. Instead, changes in the demographic breakdowns over years were likely the result of other, independent trends affecting the overall population. For example, the percent of people admitted for alcoholism who were unemployed grew markedly after 2007, no doubt due to the increases in unemployment in the overall population during the Great Recession. 

* The components of the winter severity index – the snow score and the temperature score – tracked closely with the overall index. But the length of the winter did not track with any of these other winter severity measures and did not change much over time.

### Summary statistics

I used pandas to create a dataframe of summary statistics of all variables. Examining the dataframe comprised of all the data for this pilot study, and the summary statistics dataframe, revealed that the winter severity index is the sum of the temperature score and the snow score. Additionally, the snow score has much more variance than the temperature score. 

The summary statistics of the target variables showed that some subgroups were comprised of small numbers of alcoholism patients. Based on means, the categories veterans, part-time workers, and people separated from their spouses accounted for 1.5% to 7% of the total.

### Correlations

I used pandas to compute correlations between all variables. Target variables, particularly those with high Ns, were highly correlated with each other, as would be expected. However, these correlations are not relevant to this pilot study. They only demonstrate, for example, that when the total number of alcoholism patients increases, the number of both male and female patients also increases, because these male and female counts are merely breakdowns of the total number. 

### Heatmaps

I used seaborn and matplotlib to make a heatmap of the correlations between features and target variables. I first used a diverging color palette to show which correlations were positive and which were negative, including all the finer subgroup classifications. The heatmap showed:

1.	Overall there are systematic negative correlations between the winter severity index (and its components) and the number of people admitted for alcoholism treatment.
2.	There are only two positive correlations, and they are so small as to be meaningless (.04 was the largest). 
3.	Additionally, in terms of correlation size and direction, subgroups with the smallest counts were outliers from the other correlations. 

### Consolidating target variables

Findings 2 and 3, as well as the small counts for some subgroups noted above, lead me to combine smaller subgroups into fewer, larger subgroups, and rerun the correlations. I made the following changes to the target variables:

* I deleted the veteran and civilian variables, since it had missing data, and most admissions were not veterans.
* For education, five subgroups were reduced to three: those without high school degrees, those with high school degrees, and those with any amount of college education.
* For employment status, four groups were combined into two: those working (part-time or full-time), and those not working (unemployed or not in the workforce).
* Four marital status groups were combined into two: those who were never married, and those who were currently or formerly married.

### Increased correlations

I used pandas to calculate summary statistics for these new, consolidated target variables, and to calculate correlations between these target variables and the features. Except for the end of winter scale, the size of the negative correlations between the features and the target variables increased when the demographic subgroups were consolidated. The average increases of the negative correlations are listed below:

AWSSI	:	0.05
Temp Score:	0.04
Snow Score:	0.05
Length:		0.04
Start:		0.04

The heatmap of the correlations showed mostly negative correlations in the range of  – 0.30 to – 0.40. Most of the correlations that were outliers were in the three time-related features, length, and winter start and end.

### Feature correlations

Finally, I wondered if all three features were useful, and whether one could be chosen as the best predictor of target variables. As noted above, the overall winter severity index is the sum of the snow score and the temperature score. I ran correlations between all six features and created a heat map. Snow score and temperature score were highly correlated to the overall index (0.97 and 0.82, respectively).  While snow score had a correlation to temperature score of 0.68, the time-related features had the lowest correlations to the index and its components.

### Revised hypothesis

Based on my exploratory data analysis and review of inferential statistics, I have changed the hypothesis for this pilot study: *MILDER* winter weather may be predictive of demand for treatment of alcoholism.
