<h2 class="css-1sl3ts5">1. Introduction to pandas DataFrames</h2>
<div><h2 class="css-1sl3ts5">2. What's the point of pandas?</h2><p>pandas is a Python package for data manipulation. It can also be used for data visualization; we'll get to that in Chapter 4.

</p></div><div><h2 class="css-1sl3ts5">3. Course outline</h2><p>We'll start by talking about DataFrames, which form the core of pandas.

In chapter 2, we'll discuss aggregating data to gather insights.

In chapter 3, you'll learn all about slicing and indexing to subset DataFrames.

Finally, you'll visualize your data, deal with missing data, and read data into a DataFrame.

Let's dive in.

</p></div><div><h2 class="css-1sl3ts5">4. pandas is built on NumPy and Matplotlib</h2><p>pandas is built on top of two essential Python packages, NumPy and Matplotlib. Numpy provides multidimensional array objects for easy data manipulation that pandas uses to store data, and Matplotlib has powerful data visualization capabilities that pandas takes advantage of.

</p></div><div><h2 class="css-1sl3ts5">5. pandas is popular</h2><p>pandas has millions of users, with PyPi recording about 14 million downloads in December 2019. This represents almost the entire Python data science community!

</p><ol class="css-lsly4i"><li><sup>1</sup> https://pypistats.org/packages/pandas</li></ol></div><div><h2 class="css-1sl3ts5">6. Rectangular data</h2><p>There are several ways to store data for analysis, but rectangular data, sometimes called "tabular data" is the most common form. In this example, with dogs, each observation, or each dog, is a row, and each variable, or each dog property, is a column. pandas is designed to work with rectangular data like this.

</p></div><div><h2 class="css-1sl3ts5">7. pandas DataFrames</h2><p>In pandas, rectangular data is represented as a DataFrame object. Every programming language used for data analysis has something similar to this. R also has DataFrames, while SQL has database tables. Every value within a column has the same data type, either text or numeric, but different columns can contain different data types.

</p></div><div><h2 class="css-1sl3ts5">8. Exploring a DataFrame: .head()</h2><p>When you first receive a new dataset, you want to quickly explore it and get a sense of its contents. pandas has several methods for this. 

The first is head, which returns the first few rows of the DataFrame. We only had seven rows to begin with, so it's not super exciting, but this becomes very useful if you have many rows.

</p></div><div><h2 class="css-1sl3ts5">9. Exploring a DataFrame: .info()</h2><p>The info method displays the names of columns, the data types they contain, and whether they have any missing values.

</p></div><div><h2 class="css-1sl3ts5">10. Exploring a DataFrame: .shape</h2><p>A DataFrame's shape attribute contains a tuple that holds the number of rows followed by the number of columns. Since this is an attribute instead of a method, you write it without parentheses.

</p></div><div><h2 class="css-1sl3ts5">11. Exploring a DataFrame: .describe()</h2><p>The describe method computes some summary statistics for numerical columns, like mean and median. "count" is the number of non-missing values in each column. describe is good for a quick overview of numeric variables, but if you want more control, you'll see how to perform more specific calculations later in the course.

</p></div><div><h2 class="css-1sl3ts5">12. Components of a DataFrame: .values</h2><p>DataFrames consist of three different components, accessible using attributes.

The values attribute, as you might expect, contains the data values in a 2-dimensional NumPy array.

</p></div><div><h2 class="css-1sl3ts5">13. Components of a DataFrame: .columns and .index</h2><p>The other two components of a DataFrame are labels for columns and rows. The columns attribute contains column names, and the index attribute contains row numbers or row names. Be careful, since row labels are stored in dot-index, not in dot-rows.

Notice that these are Index objects, which we'll cover in Chapter 3. This allows for flexibility in labels. For example, the dogs data uses row numbers, but row names are also possible.

</p></div><div><h2 class="css-1sl3ts5">14. pandas Philosophy</h2><p>Python has a semi-official philosophy on how to write good code called The Zen of Python. One suggestion is that given a programming problem, there should only be one obvious solution. As you go through this course, bear in mind that pandas deliberately doesn't follow this philosophy. Instead, there are often multiple ways to solve a problem, leaving you to choose the best. 

In this respect, pandas is like a Swiss Army Knife, giving you a variety of tools, making it incredibly powerful, but more difficult to learn. In this course, we aim for a more streamlined approach to pandas, only covering the most important ways of doing things.



<div class="listview__content"><div class="exercise--assignment exercise--typography"><h1 class="exercise--title">1- Inspecting a DataFrame</h1><div class="">
<p>When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.</p>
<ul>
<li><code>.head()</code> returns the first few rows (the “head” of the DataFrame).</li>
<li><code>.info()</code> shows information on each of the columns, such as the data type and number of missing values.</li>
<li><code>.shape</code> returns the number of rows and columns of the DataFrame.</li>
<li><code>.describe()</code> calculates a few summary statistics for each column.</li>
</ul>
<p><code>homelessness</code> is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The <code>individual</code> column is the number of homeless individuals not part of a family with children. The <code>family_members</code> column is the number of homeless individuals part of a family with children. The <code>state_pop</code> column is the state's total population.</p>
</div></div></div>

<li> Print the head of the homelessness DataFrame.</li>
<li>Print information about the column types and missing values in <code>homelessness</code>.</li>
<li>Print the number of rows and columns in <code>homelessness</code>.</li>
<li>Print some summary statistics that describe the <code>homelessness</code> DataFrame.</li>

In [2]:
import pandas as pd
homelessness = pd.read_csv('datasets/homelessness.csv')
# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())

   Unnamed: 0              region       state  individuals  family_members  \
0           0  East South Central     Alabama       2570.0           864.0   
1           1             Pacific      Alaska       1434.0           582.0   
2           2            Mountain     Arizona       7259.0          2606.0   
3           3  West South Central    Arkansas       2280.0           432.0   
4           4             Pacific  California     109008.0         20964.0   

   state_pop  
0    4887681  
1     735139  
2    7158024  
3    3009733  
4   39461588  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
Unnamed: 0        51 non-null int64
region            51 non-null object
state             51 non-null object
individuals       51 non-null float64
family_members    51 non-null float64
state_pop         51 non-null int64
dtypes: float64(2), int64(2), object(2)
memory usage: 2.5+ KB
None
(51, 6)
       Unnamed: 0    individuals  family_m

<h1 class="exercise--title">2- Parts of a DataFrame</h1><div class="">
<p>To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:</p>
<ul>
<li><code>.values</code>: A two-dimensional NumPy array of values.</li>
<li><code>.columns</code>: An index of columns: the column names.</li>
<li><code>.index</code>: An index for the rows: either row numbers or row names.</li>
</ul>
<p>You can usually think of indexes as a list of strings or numbers, though the pandas <code>Index</code> data type allows for more sophisticated options. (These will be covered later in the course.)</p>
<p><code>homelessness</code> is available.</p></div></div></div></div><div class="listview__section" style="min-height: calc(100% - 33px);"><div><div role="button" class="listview__header"><div class="exercise--sidebar-header"><h5 class="dc-panel__title"><svg aria-label="checkmark_circle icon" class="dc-icon-checkmark_circle dc-u-color-navy dc-u-mr-8" fill="currentColor" height="12" role="Img" width="12"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark_circle"></use></svg>Instructions</h5>

<li>Print a 2D NumPy array of the values in <code>homelessness</code>.</li>
<li>Print the column names of <code>homelessness</code>.</li>
<li>Print the index of <code>homelessness</code>.</li>
</ul></div></div></div></div></div></div></div></div></div></aside>

In [8]:
# Print the values of homelessness
print(homelessness.values)

# Print the column index of homelessness
print(homelessness.columns)

# Print the row index of homelessness
print (homelessness.index)

[[0 'East South Central' 'Alabama' 2570.0 864.0 4887681]
 [1 'Pacific' 'Alaska' 1434.0 582.0 735139]
 [2 'Mountain' 'Arizona' 7259.0 2606.0 7158024]
 [3 'West South Central' 'Arkansas' 2280.0 432.0 3009733]
 [4 'Pacific' 'California' 109008.0 20964.0 39461588]
 [5 'Mountain' 'Colorado' 7607.0 3250.0 5691287]
 [6 'New England' 'Connecticut' 2280.0 1696.0 3571520]
 [7 'South Atlantic' 'Delaware' 708.0 374.0 965479]
 [8 'South Atlantic' 'District of Columbia' 3770.0 3134.0 701547]
 [9 'South Atlantic' 'Florida' 21443.0 9587.0 21244317]
 [10 'South Atlantic' 'Georgia' 6943.0 2556.0 10511131]
 [11 'Pacific' 'Hawaii' 4131.0 2399.0 1420593]
 [12 'Mountain' 'Idaho' 1297.0 715.0 1750536]
 [13 'East North Central' 'Illinois' 6752.0 3891.0 12723071]
 [14 'East North Central' 'Indiana' 3776.0 1482.0 6695497]
 [15 'West North Central' 'Iowa' 1711.0 1038.0 3148618]
 [16 'West North Central' 'Kansas' 1443.0 773.0 2911359]
 [17 'East South Central' 'Kentucky' 2735.0 953.0 4461153]
 [18 'West South Cen

<div class="listview__content"><div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Sorting rows</h1><div class="">
<p>Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to <code>.sort_values()</code>.</p>
<p>In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.</p>
<table>
<thead>
<tr>
<th>Sort on …</th>
<th>Syntax</th>
</tr>
</thead>
<tbody>
<tr>
<td>one column</td>
<td><code>df.sort_values("breed")</code></td>
</tr>
<tr>
<td>multiple columns</td>
<td><code>df.sort_values(["breed", "weight_kg"])</code></td>
</tr>
</tbody>
</table>
<p>By combining <code>.sort_values()</code> with <code>.head()</code>, you can answer questions in the form, "What are the top cases where…?".</p>
<p><code>homelessness</code> is available and <code>pandas</code> is loaded as <code>pd</code>.</p></div></div></div>

<h1 class="exercise--title">Instructions</h1>
<div class="listview__content"><div><div class="instructions--bullet bullet-steps"><div class="bullet-instructions-list"><ul><li class="bullet-instruction instruction--completed active--instruction"><div class="subexercise-tab"><div class="tab-line"></div><div style="position: relative;"><a href="javascript:void(0)" class="active-tab completed"><svg aria-label="checkmark icon" class="dc-icon-checkmark" fill="currentColor" height="8" role="Img" width="8"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark"></use></svg></a></div></div><div class="bullet-instructions-list__instruction-content"><div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Sort <code>homelessness</code> by the number of homeless individuals, from smallest to largest, and save this as <code>homelessness_ind</code>.</li>
<li>Print the head of the sorted DataFrame.</li>
</ul></div><div style="margin: 16px -15px 0px;"><section class="dc-sct-feedback" tabindex="-1"><div></div><nav class="dc-sct-feedback__nav"><div class="css-6is1tf"></div></nav></section></div></div></div></li><li class="bullet-instruction clickable-tab instruction--completed"><div class="subexercise-tab"><div class="tab-line"></div><div style="position: relative;"><a href="javascript:void(0)" class="completed"><svg aria-label="checkmark icon" class="dc-icon-checkmark" fill="currentColor" height="8" role="Img" width="8"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark"></use></svg></a></div></div><div class="bullet-instructions-list__instruction-content"><div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Sort <code>homelessness</code> by the number of homeless <code>family_members</code> in descending order, and save this as <code>homelessness_fam</code>.</li>
<li>Print the head of the sorted DataFrame.</li>
</ul></div></div></div></li><li class="bullet-instruction clickable-tab instruction--completed"><div class="subexercise-tab"><div class="tab-line"></div><div style="position: relative;"><a href="javascript:void(0)" class="completed"><svg aria-label="checkmark icon" class="dc-icon-checkmark" fill="currentColor" height="8" role="Img" width="8"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark"></use></svg></a></div></div><div class="bullet-instructions-list__instruction-content"><div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Sort <code>homelessness</code> first by region (ascending), and then by number of family members (descending). Save this as <code>homelessness_reg_fam</code>.</li>
<li>Print the head of the sorted DataFrame.</li>
</ul></div></div></div></li></ul></div></div></div></div>

In [3]:
# Sort homelessness by individual
homelessness_ind = homelessness.sort_values('individuals')

# Print the top few rows
print(homelessness_ind.head())

    Unnamed: 0              region         state  individuals  family_members  \
50          50            Mountain       Wyoming        434.0           205.0   
34          34  West North Central  North Dakota        467.0            75.0   
7            7      South Atlantic      Delaware        708.0           374.0   
39          39         New England  Rhode Island        747.0           354.0   
45          45         New England       Vermont        780.0           511.0   

    state_pop  
50     577601  
34     758080  
7      965479  
39    1058287  
45     624358  


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Subsetting columns</h1><div class="">
<p>When working with data, you may not need all of the variables in your dataset. Square brackets (<code>[]</code>) can be used to select only the columns that matter to you in an order that makes sense to you.
To select only <code>"col_a"</code> of the DataFrame <code>df</code>, use</p>
<pre><code>df["col_a"]
</code></pre>
<p>To select <code>"col_a"</code> and <code>"col_b"</code> of <code>df</code>, use</p>
<pre><code>df[["col_a", "col_b"]]
</code></pre>
<p><code>homelessness</code> is available and <code>pandas</code> is loaded as <code>pd</code>.</p></div></div>

<h1>Instrcutions</h1>
<div class="listview__content"><div><div class="instructions--bullet bullet-steps"><div class="bullet-instructions-list"><ul><li class="bullet-instruction instruction--completed active--instruction"><div class="subexercise-tab"><div class="tab-line"></div><div style="position: relative;"><a href="javascript:void(0)" class="active-tab completed"><svg aria-label="checkmark icon" class="dc-icon-checkmark" fill="currentColor" height="8" role="Img" width="8"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark"></use></svg></a></div></div><div class="bullet-instructions-list__instruction-content"><div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Create a DataFrame called <code>individuals</code> that contains only the <code>individuals</code> column of <code>homelessness</code>.</li>
<li>Print the head of the result.</li>
</ul></div><div style="margin: 16px -15px 0px;"><section class="dc-sct-feedback" tabindex="-1"><div></div><nav class="dc-sct-feedback__nav"><div class="css-6is1tf"></div></nav></section></div></div></div></li><li class="bullet-instruction clickable-tab instruction--completed"><div class="subexercise-tab"><div class="tab-line"></div><div style="position: relative;"><a href="javascript:void(0)" class="completed"><svg aria-label="checkmark icon" class="dc-icon-checkmark" fill="currentColor" height="8" role="Img" width="8"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark"></use></svg></a></div></div><div class="bullet-instructions-list__instruction-content"><div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Create a DataFrame called <code>state_fam</code> that contains only the <code>state</code> and <code>family_members</code> columns of <code>homelessness</code>, in that order.</li>
<li>Print the head of the result.</li>
</ul></div></div></div></li><li class="bullet-instruction clickable-tab instruction--completed"><div class="subexercise-tab"><div class="tab-line"></div><div style="position: relative;"><a href="javascript:void(0)" class="completed"><svg aria-label="checkmark icon" class="dc-icon-checkmark" fill="currentColor" height="8" role="Img" width="8"><use xlink:href="/campus/static/media/symbols.e369b265.svg#checkmark"></use></svg></a></div></div><div class="bullet-instructions-list__instruction-content"><div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Create a DataFrame called <code>ind_state</code> that contains the <code>individuals</code> and <code>state</code> columns of <code>homelessness</code>, in that order.</li>
<li>Print the head of the result.</li>
</ul></div></div></div></li></ul></div></div></div></div>

In [5]:
# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
print(individuals.head())

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64


<div class="exercise--assignment exercise--typography"><h1 class="exercise--title">Subsetting rows</h1><div class="">
<p>A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as <em>filtering rows</em> or <em>selecting rows</em>.</p>
<p>There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return <code>True</code> or <code>False</code> for each row, then pass that inside square brackets.</p>
<pre><code>dogs[dogs["height_cm"] &gt; 60]
dogs[dogs["color"] == "tan"]
</code></pre>
<p>You can filter for multiple conditions at once by using the "bitwise and" operator, <code>&amp;</code>.</p>
<pre><code>dogs[(dogs["height_cm"] &gt; 60) &amp; (dogs["color"] == "tan")]
</code></pre>
<p><code>homelessness</code> is available and <code>pandas</code> is loaded as <code>pd</code>.</p></div></div>