# 4 Numerical EDA

<div>

<div>

In this chapter, you'll be working with a dataset obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset!

Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of [Statistical Thinking in Python (Part 1)](https://www.datacamp.com/courses/statistical-thinking-in-python-part-1).

Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called `df`. Use pandas' `.head()`, `.info()`, and `.describe()` methods in the IPython Shell to explore the DataFrame, and select the statement below that is **not** true.

</div>

</div>

<div class="exercise--instructions-title">

##### Possible Answers

</div>

<div data-onboarding="instructions" class="exercise--typography">

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_0" class="dc-input-radio" data-cy="mce-option"><input id="inp_0" data-cy="multiple-choice-input-0" type="radio" class="dc-input-radio__input" value="1"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">The DataFrame has a total of `435` rows and `17` columns.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_1" class="dc-input-radio" data-cy="mce-option"><input id="inp_1" data-cy="multiple-choice-input-1" type="radio" class="dc-input-radio__input" value="2"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">Except for `'party'`, all of the columns are of type `int64`.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_2" class="dc-input-radio" data-cy="mce-option"><input id="inp_2" data-cy="multiple-choice-input-2" type="radio" class="dc-input-radio__input" value="3"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">The first two rows of the DataFrame consist of votes made by Republicans and the next three rows consist of votes made by Democrats.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_3" class="dc-input-radio" data-cy="mce-option"><input id="inp_3" data-cy="multiple-choice-input-3" type="radio" class="dc-input-radio__input" value="4"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">There are 17 _predictor variables_, or _features_, in this DataFrame.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_4" class="dc-input-radio" data-cy="mce-option"><input id="inp_4" data-cy="multiple-choice-input-4" type="radio" class="dc-input-radio__input" value="5"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">The target variable in this DataFrame is `'party'`.</div>

    </div>

    </label></div>

    </div>

</div>

# 5 Visual EDA

<div>

<div>

The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the `scatter_matrix()` function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1\. So a different type of plot would be more useful here, such as [Seaborn's `countplot`](http://seaborn.pydata.org/generated/seaborn.countplot.html).

Given on the right is a `countplot` of the `'education'` bill, generated from the following code:

    plt.figure()
    sns.countplot(x='education', hue='party', data=df, palette='RdBu')
    plt.xticks([0,1], ['No', 'Yes'])
    plt.show()

In `sns.countplot()`, we specify the x-axis data to be `'education'`, and hue to be `'party'`. Recall that `'party'` is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the `'education'` bill, with each party colored differently. We manually specified the color to be `'RdBu'`, as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly _against_ this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the `'satellite'` and `'missile'` bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in _favor_ of, compared to Republicans? Be sure to begin your plotting statements for each figure with `plt.figure()` so that a new figure will be set up. Otherwise, your plots will be overlayed onto the same figure.

</div>

</div>

<div>

<div role="button" class="listview__header">

<div class="exercise--sidebar-header">

##### Instructions

<span class="tag tag--xp">50 XP</span></div>

</div>

</div>

<div class="listview__content">

<div>

<div class="exercise--instructions">

<div class="exercise--instructions-title">

##### Possible Answers

</div>

<div data-onboarding="instructions" class="exercise--typography">

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_0" class="dc-input-radio" data-cy="mce-option"><input id="inp_0" data-cy="multiple-choice-input-0" type="radio" class="dc-input-radio__input" value="1"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">`'satellite'`.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_1" class="dc-input-radio" data-cy="mce-option"><input id="inp_1" data-cy="multiple-choice-input-1" type="radio" class="dc-input-radio__input" value="2"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">`'missile'`.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_2" class="dc-input-radio" data-cy="mce-option"><input id="inp_2" data-cy="multiple-choice-input-2" type="radio" class="dc-input-radio__input" value="3"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">Both `'satellite'` and `'missile'`.</div>

    </div>

    </label></div>

    </div>

*   <div class="dc-edge-to-edge multiple-choice__options">

    <div class="dc-edge-to-edge__item"><label for="inp_3" class="dc-input-radio" data-cy="mce-option"><input id="inp_3" data-cy="multiple-choice-input-3" type="radio" class="dc-input-radio__input" value="4"><span class="dc-input-radio__indicator"></span>

    <div>

    <div class="dc-input-radio__text">Neither `'satellite'` nor `'missile'`.</div>

    </div>

    </label></div>

    </div>

<div class="multiple-choice__actions">

<div class="exercise--buttons">

<div data-tip="true" data-for="tp-submit-button" currentitem="true" style="float: right; top: -10px;">

<div class="__react_component_tooltip place-left type-dark tooltip left" data-id="tooltip" style="left: 175px; top: 630px;">

<div class="tooltip-inner">Enter</div>

</div>

<button aria-label="button" class="dc-btn dc-btn--green dc-btn--sm" type="button" id="mc-submit" data-test-id="submit-solution-button" data-cy="submit-button"><span>

<div class="dc-btn__content">Submit Answer</div>

</span></button></div>

</div>

<div class="campus-dc-sct-feedback" tabindex="-1">

<div data-tip="true" data-for="tp-hint" currentitem="true" style="display: inline-block;">

<div class="__react_component_tooltip place-right type-dark tooltip top" data-id="tooltip" style="left: 181px; top: 369px;">

<div class="tooltip-inner">Ctrl+H</div>

</div>

[<span>Take Hint (-15 XP)</span>](javascript:void(0))</div>

</div>

</div>

</div>

</div>

</div>

</div>







In [6]:
# Import plotting modules
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Pandas libraries with alias 'pd' 
import pandas as pd 

# Read data from file 'filename.csv' 
# (in the same directory that your python process is based)
# Control delimiters, rows, column names with read_csv (see later) 
df = pd.read_csv("df2.csv") 

plt.figure()
sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()


plt.figure()
sns.countplot(x='missile', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()

ValueError: Could not interpret input 'satellite'

<Figure size 432x288 with 0 Axes>

In [35]:
df = pd.read_csv('./data/house-votes.csv') 

FileNotFoundError: File b'./data/house-votes.csv' does not exist

In [20]:
my_path = os.path.abspath(os.path.dirname(__file__))
path = os.path.join(my_path, "../data/votes.csv")
with open(path) as f:
    test = list(csv.reader(f))

NameError: name '__file__' is not defined