# Week 02 Pre-Class Assignment: Exploratory Data Analysis

### <p style="text-align: right;"> &#9989; Kyle Taft


![CA](https://miro.medium.com/max/671/1*f82SOgbdQOmY5DHmF0kdgw.png)



## Goals for today's pre-class assignment
In this Pre-Class Assignment you are going to complete the data science portion of the book's End-to-End project (chapter 2). The main learning goals are:
* understand how to build and manage a real ML project, much like your project will be,
* practice using some of the data science tools (e.g., Pandas),
* learn some new tools that will help you in your project.


**This assignment is due by 11:59 p.m. the day before class,** and should be uploaded into the appropriate "Pre-Class Assignments" submission folder on D2L.  Submission instructions can be found at the end of the notebook.

___
<h2><center> <font color='green'>Machine Learning Housing Corp.</font></h2></center>

The author of your book thoughtfully provided the entire code base that he used to build Chapter 2. You should be able to find all of the code for your [textbook at GitHub](https://github.com/ageron/handson-ml3). Be sure you work with this document _and_ the code from the textbook so that you don't end up writing a ton of code yourself, which is not the point. (If you _want_ to write your own code, that is totally fine too! We are just using the code from the textbook to save time.)

**Note:** It will be very useful to have your textbook handy.

Follow these steps:

1. Download the [Chapter 2 notebook](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb) from GitHub
2. Run the notebook up to Part 3 (Prepare the Data for Machine Learning Algorithms) inclusive and **make sure you understand what every code cell is doing**. 
3. Answer the questions below.
4. Turn in _this_ notebook with your answers in the usual way (no need to resubmit the notebook from the textbook).

What you will do is read through the textbook's notebook and answer questions about it. Some of the answers are in the textbook itself, some in the notebook.

## Part 1. Pandas and Data

Once you are certain the textbook's [notebook](https://github.com/ageron/handson-ml3/blob/main/02_end_to_end_machine_learning_project.ipynb) is working (run all of the cells - it needs to go out to the web to get information), go through the first portion and answer these questions:

1. Describe in your own words what the goals of this project are.  
2. Read through the code. See if there are interesting ideas/tricks there that you didn't not know about. What did you find? 
3. What form is the data in, and are there any problems with it? For example, are all of the potential features all integers or floats and what `pandas` function can help you answer this question?
4. What does `.value_counts()` do? 
5. What does `.describe()` do? 
6. What do `.iloc` and `.loc` do?

<font size=6 color="#009600">&#9998;</font> *Put your answers here!* 

1. By being giving the population, median income, median housing price, and other information about a district, the goal is to predict the median housing price in the district.
2. The most interesting piece of code is the download of the data from the website. It shows off a lot of cool tricks of dealing with paths and downloading data. Also, I forgot that you can use .value_counts() to get the number of times a value appears in a column.
3. The data comes from a .csv file. Looking at housing.info() it seems that it was all read in correctly as floats except for ocean_proximity. If we want to use ocean_proximity as a feature, we will have to convert it to a number.
4. .value_counts() returns the number of times unique values appear in a column.
5. .describe() returns a summary of the numerical data. It gives the count, mean, standard deviation, min, max, and quartiles.
6. .iloc indexes with integer locations. .loc indexes using the labels of the rows and columns.

## Part 2. Histogram

Let's move below the first 3x3 array of histograms. Answer these questions in detail.

1. In the first set of 3x3 histograms, do you see anything there that seems odd/interesting/useful/bothersome to you? How would you deal with that problem? 
2. What does the author choose to do in terms of splitting the data into testing and training? Does the author use cross validation?
3. What is `StratifiedShuffleSplit` and why would you use it? What problem does it solve for you?
4. How is `ocean_proximity` handled?

<font size=6 color="#009600">&#9998;</font> *Put your answers here!* 

1. Median income is on a scale of 0-15 which doesn't make sense being in dollars. I would look at the source of the data and see what this means. Also, median house age and house value are capped at certain values. I would probably try to find out why they are capped and if I can get the uncapped data. If not, I would probably remove the capped data from the dataset if it is not necessary for the model.
2. The author makes a column of income categories and splits the data into training and testing sets with consideration of the proportions of each income category. This is into a 20% testing set and 80% training set. The author uses StratifiedShuffleSplit which is a cross validation method.
3. This creates our cross validation sets while keeping the proportions of each income category the same in the training and testing sets. This solves the problem of having a skewed dataset.
4. As of this point, ocean_proximity is not handled. Later on, it is converted to a number using sklearn's OrdinalEncoder.

## Part 3. Visualization

Ok, let's move into the visualization part. The author may use plotting tools you would not normally use, so let's see what he did. (For example, how was the 3x3 histrogram made? Seaborn? Or?)

1. What tool is the author using to make these plots? Straight matplotlib, or something else? 
2. Go through the code below very carefully. What are all of these options? 

    `housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)`


3. The author ends up with a very nice plot that uses a real map. How did he do that? What tools did he need to do that?


<font size=6 color="#009600">&#9998;</font> *Put your answers here!* 

1. The author is directly using pandas methods to make these plots.
2. kind is the type of plot. x and y are the x and y data (latitude and longitude) to be plotted. alpha is the transparency of the points (~population density). s is the size of the circles (population). label is the label of the points (population). figsize is the size of the plot. c is the color of the points (median house value). cmap is the color map. colorbar displays the color bar. sharex is whether or not to share the x axis and specifically fixes a display bug.
3. He downloads a map of California and uses plt.imshow() to display it on the plot.

## Part 4. Correlations

Next, the author spends a lot of time looking for correlations. Go through this section very carefully!

1. What is the author trying to achieve by looking at correlations? Give a very detailed answer.
2. What does `corr_matrix["median_house_value"].sort_values(ascending=False)` do?
3. What is `scatter_matrix`?
4. What do the scatter plots tell you?
5. Move into the ML portion of the notebook. Go to `sklearn`'s webpages and learn what this does and why you would use it:
   
       from sklearn.impute import SimpleImputer
       imputer = SimpleImputer(strategy="median")`





<font size=6 color="#009600">&#9998;</font> *Put your answers here!* 

1. Find obvious linear relationships between the features and the median house value. This is to help find features that are useful for our model. This is not to find any non-linear relationships.
2. This returns the Pearson's r between each feature and the median house value in descending order.
3. scatter_matrix plots each column against each other column. It is useful for finding relationships between features.
4. We can see that there is a linear relationships between median house value and median income. But also there are artefacts of horizontal lines in the plots.
5. This allows us to fill in missing values with, in this case, the median of the feature. This is useful because we want to preserve the data when we can instead of just removing it. This is a simple method of doing this compared to methods such as KNN imputation.

Hopefully you learned a lot of new techniques for handling real data. Think about how these will help you in your project. 

Be sure to read Chapter 2 very carefully before the ICA. 

---
## Assignment wrap-up

Please fill out the form that appears when you run the code below.  **You must completely fill this out in order to receive credit for the assignment!**

In [1]:
from IPython.display import HTML
HTML(
"""
<iframe 
	src="https://forms.office.com/r/QyrbnptkyA" 
	width="800px" 
	height="600px" 
	frameborder="0" 
	marginheight="0" 
	marginwidth="0">
	Loading...
</iframe>
"""
)

&#169; Copyright 2023, Department of Computational Mathematics, Science and Engineering at Michigan State University.