# Lesson 2: Importing Data Into Python

In this lesson, we'll learn to pull data into our Python environment and run some of the first diagnostics we'll need to understand our data.

Why don't you kick us off by importing our two most important data analytic tools into our Jupyter notebook: Pandas and Numpy?

## Exercise 1
Import Pandas and NumPy into your Jupyter Notebook and assign the standard aliases to them. 

Now let's go straight into importing data into our Python environment! Check out DC's Open Data Portal at opendata.dc.gov.

From here, we'll import data into our Python environment in two different ways. First, let's try downloading the data from the website and then pulling it into this notebook. 

## Exercise 2
Find the City Service Requests for 2016 in DC's Open Data Portal and download the spreadsheet (.csv file) to your computer.

This is a data set containing requests for service through DC's 311 system. People can enter these requests by calling 311, texting DC-311 (32-311), using the 311 app, or visiting the city's online 311 portal. 

Now let's try pulling the data into our Python environment using Pandas. 

## Exercise 3

Pull the .csv file into your Python environment using Pandas and assign it to an object called 'df'.

What happened? You may have received an error like this: "OSError: File b'City_Service_Requests_in_2016.csv' does not exist"

This means that the .csv file is not in the directory your notebook is calling from (which is, in fact, the directory the notebook is in). To find out which directory your notebook is calling from, import the os package and get the current working directory.

## Exercise 4
Import the os package and get the current working directory.

Ok, now that we know what the current working directory is, we must either change the working directory or move the file into the current working directory. I'll let you decide which you want to do, but this is how you change the working directory.

In [None]:
os.chdir('C:\\Users\\peter.casey\\Downloads') 
## There's this weird thing in Python where you have to use double slashes when specifying working directories.
## If you have trouble getting a file from a directory you know it's in, this is a common error.

Now you should be able to pull the data into your Jupyter Notebook. Try it again.

We can also pull the .csv file directly from the Open Data Portal by using the link.

In [None]:
## I'm commenting this out because it takes a while to pull the data into the environment because it's a large file
#df = pd.read_csv('https://opendata.arcgis.com/datasets/0e4b7d3a83b94a178b3d1f015db901ee_7.csv')

Great! You have the data! 

One of the first things we'd like to know when we're dealing with a data set is its shape; that is the number of rows and columns it has.

In [None]:
df.shape

This data has 302,925 rows and 30 columns. Generally speaking, rows are our "observations" or "samples", while columns are our "variables" or "features". 

Now try getting JUST the number of rows.

In [None]:
df.shape[0]

When you have an item like this with multiple elements, you can often call the elements by their number. In Python, we always start counting elements with the number '0', so that the first element is always element '0'. 

## Exercise 5
Now try getting the number of columns yourself.

The next thing we may want to do is actually LOOK at the data. But you probably don't want to print out all 300,000 rows of data in your notebook! (In reality, Pandas won't do that. Instead, it will show you a subset of the rows.)

But to have greater control we can use the head command.

In [None]:
df.head()

## Exercise 6

The default head command shows us 5 rows. Try increasing the number of rows it shows us.

This is sort of any annoying way to look at the data, in my opinion. I'm usually interested in looking at the list of columns and the kinds of values they have, so I usually transpose the data when I print it as a head.

In [None]:
df.head().T

Now you can get a sense of what your data looks like. What do you think some of these columns mean?

## Exercise 7
Think about some of the column names and values you have. What do you think these columns are?

We probably want even more information about our variables or columns, so let's learn more about them.

First, we can get a quick list of column names this way also.

In [None]:
df.columns

The columns command provides us with a list of column names.

We can also use this command to find out how many columns we have by taking its length.

In [None]:
len(df.columns)

Note that this returns the same value as df.shape[1].

We can also get column data types.

In [None]:
df.dtypes

## Exercise 8

We see three data types here. What does each mean? Why aren't there any "date" types?

We'd like to know even more about our data.

## Exercise 9
Use the describe function to learn more about your data. I like to transpose this, too. Try transposing it.

Do you notice any columns missing from the output? Why do you think they're missing?

You may notice some values are 'NaN'. What does that mean? How might we handle these values?

The describe function provides us a ton of great information about numeric variables, like integers and floats. But categorical variables, called "object" variables in Python, do not have means, mins, maxes, or standard deviations. So how might we analyze these?

One of the first steps is to take a look at the unique values of these columns. Let's start with the most interesting one: Service Code Description.

In [None]:
df['SERVICECODEDESCRIPTION'].unique()

We just did two new things right here: we called a column by using its column name, and we called its unique values.

There's another way to call columns from a Pandas DataFrame if you're feeling lazy and don't want to mess around with brackets.

In [None]:
df.SERVICECODEDESCRIPTION.unique()

## Exercise 8

We can figure out how many unique values we have by getting the length of this object. Try it.

We can also get value counts for each unique value of a categorical variable using value_counts.

In [None]:
df.SERVICECODEDESCRIPTION.value_counts()

## Exercise 9

What's the most common request? What does that request mean? Check out the District's online 311 portal at 311.dc.gov to learn more about the top service request. 