# Kaggle

## What is Kaggle?

## Machine Learning

## Accessing the `House Prices` Dataset

`crash-course/Kaggle/DATA/house-prices/train.csv`



# Exploratory Data Analysis

## The Data Science Workflow

## EDA

```{r}
training_set <- read_csv("DATA/house-prices/train.csv") 
```


```{r}
training_set %>% summarise_all(funs(sum(is.na(.))/nrow(training_set))) %>% gather(Column, Count) %>% arrange(desc(Count)) %>% head(10)
eda_set <- training_set %>% drop_na(-c(PoolQC, MiscFeature, Alley, Fence, FireplaceQu, LotFrontage))
nrow(eda_set)/nrow(training_set)
```

```{r}
pairs_plot <- function(df){
  df_x <- rename_all(df, function(str){paste(str, "x", sep = ".")})
  df_y <- rename_all(df, function(str){paste(str, "y", sep = ".")})
  cbind(df_x,df_y) %>%
    gather(xVar, xVal, ends_with("x")) %>% gather(yVar, yVal, ends_with("y")) %>% 
    ggplot(aes(xVal, yVal, col = xVar)) + geom_point() + facet_grid(yVar ~ xVar, scales = "free") + 
    labs(x = "X Variable", y = "Y Variable")
}
pairs_plot(eda_set)
```

```{r}
pca_plot <- function(df, scale = F) {
  autoplot(prcomp(df, scale = scale), data = df, 
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)
}
pca_set <- eda_set %>% select(- SalePrice) %>% select_if(is.numeric) %>% drop_na
pca_plot(pca_set)
eda_pca <- prcomp(pca_set)
summary(eda_pca)
eda_pca %>% extract(c("sdev", "center")) %>% as.data.frame %>% mutate(Feature = rownames(.)) %>% arrange(desc(center)) %>% 
  slice(1:20) %>% 
  pull(Feature) -> important_features
print(important_features)
  
```



# Feature Selection

# A First Approach to Machine Learning: Linear Regression

```{r}
simple_model <- lm(data = training_set, formula = SalePrice ~ LotArea)
simple_model %>% augment %>% select(`SalePrice (Actual)` = SalePrice, `SalePrice (Predicted)` = .fitted, LotArea) %>%
  gather(Value, SalePrice, - LotArea) %>%
  ggplot(aes(LotArea, SalePrice, col = Value, linetype = Value)) + geom_point()

initial_model <- lm(data = training_set, formula = SalePrice ~ LotArea + GrLivArea)
initial_model %>% augment %>% select(`SalePrice (Actual)` = SalePrice, `SalePrice (Predicted)` = .fitted, LotArea, GrLivArea) %>%
  gather(Value, SalePrice, - c(LotArea, GrLivArea)) %>%
  ggplot(aes(LotArea, SalePrice, col = Value, linetype = Value)) + geom_point()
initial_model %>% augment %>% 
  ggplot(aes(LotArea, .resid)) + geom_point() + 
  geom_hline(yintercept = 0, col = "black", linetype = "dashed") +
  labs(x = "Lot Area (Square Ft.)", y = "Residual", title = "Residual Plot of Iniital Linear Model") 

model <- lm(data = training_set, 
            formula = as.formula(paste("SalePrice ~ ", paste(important_features %>% paste0("`", ., "`"), collapse= " + "))))
model %>% augment %>% 
  ggplot(aes(SalePrice, .resid)) + geom_point() + 
  geom_hline(yintercept = 0, col = "black", linetype = "dashed") +
  labs(x = "Sales Price", y = "Residual", title = "Residual Plot of 'Important Features' Linear Model") 
```




# Extensions to Linear Regression

# Conclusion

This ends our textbook-style primer into deep learning with Keras. While this was just an introduction to neural nets, we hope that you can now see some of the workflow patterns associated with machine learning. Feel free to play around with the code above to get a better feel for the hyperparameters of the neural net model. As always, please email [`contact@arun.run`](mailto:contact@arun.run) or [`prc@berkeley.edu`](mailto:prc@berkeley.edu) with any questions or concerns whatsoever. Happy machine learning!

## Sneakpeek at SUSA Kaggle Competition II

After Spring Break, we will be guiding you through a four-week collaborative Kaggle competition with your peers in Career Exploration! We want to give you the experience of working with real data, using real machine learning algorithms, in an educational setting. You will have to choose either Python or R, and dive into reading kernels on the Kaggle website, use visualization and feature engineering to improve your score, and maybe even pick up a few advanced deep learning models along the way. If this sounds a bit intimidating right now, do not fret! Your SUSA Mentors will be there to mentor you through the whole thing. So rest up during Spring Break, and come back ready to tackle your biggest data challenge yet!

# Additional Reading
* For more information on the Kaggle API, a command-line program used to download and manage Kaggle datasets, visit the [Kaggle API Github page](https://github.com/Kaggle/kaggle-api)  
* For an interactive guide to learning R and Python, visit [DataCamp](https://www.datacamp.com/) a paid tutorial website for learning data computing.
