# Module 8 - Final Project 


**In this final project, you will demonstrate what you have learned in this class by applying a thorough data analysis to a data set of your choice.** 

As discussed on the discussion boards in the last two weeks, your data set should contain rich enough features and a moderate size so that you can apply the methods you have learned to derive some **insights** from the data. The ultimate goal of data analysis is to learn from the data so that we can form useful and actionable knowledge. 
 

Use this notebook for the analysis of your final project. **Your deliverables for the final project are:** 

 - **This notebook**: contains the **full** data science project life cycle as you have learned in the Intro class. 
   - Load your data, clean your data, include all the carpentry you need to put it into some shape for your later analysis. 
   - EDA: exploratory data analysis: this is where you do the univariate and multivariate analyses; visualize your data to understand the shape and associations in it. Plot densities, histograms, scatter plots, etc. Look for correlations, patterns, clusters, etc. 
   - Modeling: use any of the adequate techniques to see if you can model the affects of the variables on each other. You can try to reduce dimensionality, try to do regression, hypothesis testing, etc. to see how variables affect each other. 
   - **Insights:** the point of data analysis to derive **insights** from your data. What do you see? What did you learn from the data? Did it answer your initial question or did it verify your hypothesis? 
   - **Presenting results**: Communicate your results effectively using visualization principles you have learned. 
   
   
   
 - A **pdf** file that contains a few slides to summarize your results: similar to the data vis course, create a pdf document that can contain up to 3-4 slides (**no more**) to present your results. 
 
 
 
 - Upload your data set to this folder (under exercises folder). 
 
 ---
 
 
 **You should have ample explanation and comments for all your data analysis in this notebook.** Think of it as a LAB notebook: imagine that you are creating a lab notebook for others to follow and learn from your analysis. You should have enough details so that somebody can recreate your analysis by following your descriptions and explanations. 
 
 Have a markdown cell before each code cell explaining **what** you are doing and **why** you are doing it. Have comments in the code cell. Have a markdown cell **after** every code cell explaining what you just found out by running that code cell before. 
 
 ---
 
 

## 1. Loading, Cleaning, Displaying Data 

Explain your data set here; what is it about? What are the variables? What do you want to do with it? How are you going to clean it? etc. 


In [2]:
## load your data, display with head(), clean, etc. 
# Read data from files
mortality_data <- read.csv("mortality_data.csv",header=TRUE)
incidence_data <- read.csv("incidence_data.csv",header=TRUE)
income_data <- read.csv("income_data.csv",header=TRUE)
insurance_data <- read.csv("insurance_data.csv",header=TRUE)
poverty_data <- read.csv("poverty_data.csv",header=TRUE)


In [3]:
# Show first few rows of each df
head(mortality_data)
head(incidence_data)
head(income_data)
head(insurance_data)
head(poverty_data)


Unnamed: 0_level_0,county,fips,met_objective_of_45_5_1,age_adjusted_death_rate,lower_95_confidence_interval_for_death_rate,upper_95_confidence_interval_for_death_rate,average_deaths_per_year,recent_trend_2,recent_5_year_trend_2_in_death_rates,lower_95_confidence_interval_for_trend,upper_95_confidence_interval_for_trend
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,United States,0,No,46.0,45.9,46.1,157376,falling,-2.4,-2.6,-2.2
2,"Perry County, Kentucky",21193,No,125.6,108.9,144.2,43,stable,-0.6,-2.7,1.6
3,"Powell County, Kentucky",21197,No,125.3,100.2,155.1,18,stable,1.7,0,3.4
4,"North Slope Borough, Alaska",2185,No,124.9,73.0,194.7,5,**,**,**,**
5,"Owsley County, Kentucky",21189,No,118.5,83.1,165.5,8,stable,2.2,-0.4,4.8
6,"Union County, Florida",12125,No,113.5,89.9,141.4,19,falling,-2.2,-4.3,0


Unnamed: 0_level_0,county,fips,age_adjusted_incidence_rate_e_cases_per_100_000,lower_95_confidence_interval,upper_95_confidence_interval,average_annual_count,recent_trend,recent_5_year_trend_in_incidence_rates,lower_95_confidence_interval_2,upper_95_confidence_interval_2
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,"US (SEER+NPCR)(1,10)",0,62.4,62.3,62.6,214614,falling,-2.5,-3.0,-2.0
2,"Autauga County, Alabama(6,10)",1001,74.9,65.1,85.7,43,stable,0.5,-14.9,18.6
3,"Baldwin County, Alabama(6,10)",1003,66.9,62.4,71.7,170,stable,3.0,-10.2,18.3
4,"Barbour County, Alabama(6,10)",1005,74.6,61.8,89.4,25,stable,-6.4,-18.3,7.3
5,"Bibb County, Alabama(6,10)",1007,86.4,71.0,104.2,23,stable,-4.5,-31.4,32.9
6,"Blount County, Alabama(6,10)",1009,69.7,61.2,79.0,51,stable,-13.6,-27.8,3.4


Unnamed: 0_level_0,StateFIPS,CountyFIPS,Median_Income,Median_Income_White,Median_Income_Black,Median_Income_NativeAmerican,Median_Income_Asian,Median_Income_Hispanic
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,2,13,61518,72639,31250.0,54750,62679.0,51250.0
2,2,16,84306,97321,93750.0,48750,81250.0,77500.0
3,2,20,78326,87235,50535.0,53935,63757.0,53926.0
4,2,50,51012,92647,73661.0,41594,110625.0,160114.0
5,2,60,79750,88000,,63333,,25625.0
6,2,68,81544,81029,,47500,,


Unnamed: 0_level_0,StateFIPS,CountyFIPS,M_With_Ins,M_Without_Ins,F_With_Ins,F_Without_Ins,All_With_Ins,All_Without_Ins
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,2,13,876,1317,566,540,1442,1857
2,2,16,2470,769,1707,564,4177,1333
3,2,20,120747,23245,122426,21393,243173,44638
4,2,50,6396,2708,6627,1774,13023,4482
5,2,60,419,124,349,67,768,191
6,2,68,822,253,674,250,1496,503


Unnamed: 0_level_0,State,StateFIPS,CountyFIPS,AreaName,All_Poverty,M_Poverty,F_Poverty
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<int>,<int>,<int>
1,AK,2,13,"Aleutians East Borough, Alaska",553,334,219
2,AK,2,16,"Aleutians West Census Area, Alaska",499,273,226
3,AK,2,20,"Anchorage Municipality, Alaska",23914,10698,13216
4,AK,2,50,"Bethel Census Area, Alaska",4364,2199,2165
5,AK,2,60,"Bristol Bay Borough, Alaska",69,33,36
6,AK,2,68,"Denali Borough, Alaska",254,139,115


In [4]:
# Show structure of each df
str(mortality)
str(incidence)
str(income_data)
str(insurance_data)
str(poverty_data)


'data.frame':	3141 obs. of  11 variables:
 $ county                                     : chr  "United States" "Perry County, Kentucky" "Powell County, Kentucky" "North Slope Borough, Alaska" ...
 $ fips                                       : int  0 21193 21197 2185 21189 12125 21147 21131 21159 21165 ...
 $ met_objective_of_45_5_1                    : chr  "No" "No" "No" "No" ...
 $ age_adjusted_death_rate                    : chr  "46" "125.6" "125.3" "124.9" ...
 $ lower_95_confidence_interval_for_death_rate: chr  "45.9" "108.9" "100.2" "73" ...
 $ upper_95_confidence_interval_for_death_rate: chr  "46.1" "144.2" "155.1" "194.7" ...
 $ average_deaths_per_year                    : chr  "157,376" "43" "18" "5" ...
 $ recent_trend_2                             : chr  "falling" "stable" "stable" "**" ...
 $ recent_5_year_trend_2_in_death_rates       : chr  "-2.4" "-0.6" "1.7" "**" ...
 $ lower_95_confidence_interval_for_trend     : chr  "-2.6" "-2.7" "0" "**" ...
 $ upper_95_confidence_

## 2. Exploratory Data Analysis 

Have your multiple code and markdown cells here to explore the data. Univariate, multivariate, histograms, correlations, scatter plots, missing values, etc. 


In [2]:
# codes and comments 

# have a markdown cell after every major step to explain what you did and what you learned. 


## 3. 4. 5. ... The Rest..

Similar to above, have **sections** for every major step in your analysis.  PCA, FA, Clustering, etc. running your models, hypotheses, explaining what you have learned, presenting results with visualizations, and the conclusion should have their own sections. 
