## Introduction to Python: Numpy and Pandas

In this notebook we study the main packages of Python, Numpy and Pandas. Numpy is a package for scientific computing and enables faster computaions rather pure Python. Pandas is a powerful package for data analysis and manipulation allowing to load date to table format and do faster computations over them.

## Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy.

### Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

In [None]:
import numpy as np

a = np.array([1, 2, 3])  # Create a rank 1 array
type(a)

In [None]:
a.shape

In [None]:
print(a[0], a[1], a[2]) 

In [None]:
a[0] = 5 

In [None]:
a

Create 2-dimensional array

In [None]:
b = np.array([[1,2,3],[4,5,6]])  

In [None]:
b.shape

In [None]:
b[0, 0], b[0, 1], b[1, 0]

Numpy also provides many functions to create arrays:


In [None]:
a = np.zeros((2,2))  # Create an array of all zeros
a

In [None]:
b = np.ones((1,2))   # Create an array of all ones
b

In [None]:
c = np.full((2,2), 7) # Create a constant array
c

In [None]:
d = np.eye(2)        # Create a 2x2 identity matrix
d             

In [None]:
e = np.random.random((2,2)) # Create an array filled with random values
e                    

### Array indexing

Numpy offers several ways to index into arrays.

Slicing: Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array:



In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

In [None]:
a[0, 1]

In [None]:
a[0, 0] = 77
a[0, 0] 

Print only one row

In [None]:
a[1, :]

Print only one column

In [None]:
a[:, 1]

**Boolean array indexing**: Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

In [None]:
a = np.array([[1,2], [3, 4], [5, 6]])

In [None]:
bool_idx = (a > 2)

In [None]:
bool_idx

In [None]:
a[bool_idx]

In [None]:
a[a > 2] 

### Exercises

Initialize a numpy array from a given list

In [None]:
a = [1, 2, 3, 4, -8, -10]

Print the shape of numpy array

Initialize (2,4) numpy array filled with random values

Print only the first row of created numpy array

Print the last column

Check if array values are bigger than 0.5

Choose the elements of the array that are greater than 0.5

## Introduction to Pandas

In this tutorial we download real data from NYC open data through API and analyse the data in Pandas. We cover the basic Pandas functions, visualize data and make small assignment in pairs.

Before we start import necessary packages

In [None]:
import pandas as pd

Pandas introduces two new data structures to Python - Series and DataFrame, both of which are built on top of NumPy (this means it's fast).

#### DataFrame
A DataFrame is a tablular data structure comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can also think of a DataFrame as a group of Series objects that share an index (the column names).

For the rest of the tutorial, we'll be primarily working with DataFrames.

#### Series
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

Using pandas we will explore the dataset from NYC Open data about Service Requests. This dataset is quite big since the information about complaints has stored since 2010. Data can be dowloaded from [NYC open data](https://data.cityofnewyork.us/Social-Services/311/wpe2-h2i5). For our analysis we will consider small already prepared subset of this data starting from 2015 year.

Begin with reading the data

In [None]:
complaints = pd.read_csv('nyc_complaints_data_inclass.csv')

Reading a CSV is as simple as calling the read_csv function. By default, the read_csv function expects the column separator to be a comma, but you can change that using the sep parameter.

In [None]:
complaints.shape

We can print only first 5 rows of DataFrame using function head()

In [None]:
complaints.head()

Or the last 5 rows:

In [None]:
complaints.tail()

Print all DataFrame columns

In [None]:
complaints.columns

Print columns types

In [None]:
complaints.dtypes

To print the statistics over DataFrame

In [None]:
complaints.describe()

We can print only one Series of DataFrame 

In [None]:
complaints['Complaint Type']

The second way to print the one column

In [None]:
complaints.Status

### Basics of indexing in Pandas

The first use of indexing is to use a slice, just like we have done with other Python objects. Below we slice the first 5 index values of the first dimension of the dataframe.

In [None]:
complaints[:5]

**created_date** - Date SR was created

**closed_date** - Date SR was closed by responding agency

**agency_name** - Agency name resposible for SR submission

**resolution_action_updated_date** - Date when responding agency last updated the SR

**complaint_type** - Complaint Type may have a corresponding Descriptor (below) or may stand alone.

**status** - Status of SR submitted 

**latitude** - Geo based Lat of the incident location

**longitude** - Geo based Lon of the incident location

**borough** - NYC borough of the incident

**street_name** - Street name of incident address provided by the submitter

The first indexing method is equivalent to using the iloc indexing method, which uses the integer based indexing, purely based on the location of the index.

In [None]:
complaints.iloc[:5]

A second way to index is using loc, which uses the labels of the index. Note that this approach includes the second value in the index range, whereas iloc does not.

In [None]:
complaints.loc[:5]

Note that indexing can work for both rows and colums

In [None]:
complaints.loc[:5, :'Agency Name']

In [None]:
complaints.iloc[:5, :5]

We can select rows based on their value as well.  Notice that we nest df[df[condition]] to get this result.

In [None]:
complaints[complaints['created_hour'] < 12]

In [None]:
complaints[complaints['Agency Name'] == 'Department of Finance']

We can sort DataFrame by the chosen column

In [None]:
complaints.sort_values(by='created_hour', ascending=[1])

Here we show how to set a value of a cell in the table, identifying a specific row by index label, and setting its closed date, in this case to a None value, which Pandas interprets as a NaN (missing value).

In [None]:
complaints.loc[688,'Closed Date'] = None

We can filter for values that are Null

In [None]:
complaints[complaints['Closed Date'].isnull()]

Or more commonly, filter out the null values.

In [None]:
complaints[complaints['Closed Date'].notnull()]

Here we find and print records that are related to Parking, using the str attribute and 'contains' to search for the county name in complaints type.

In [None]:
complaints[complaints['Complaint Type'].str.contains('Parking')]

We can combine two conditions in order to select rows

In [None]:
complaints[(complaints['Complaint Type'].str.contains('Parking')) & (complaints['Closed Date'].notnull())]

We can find the unique values of a column

In [None]:
complaints['Status'].unique()

In [None]:
complaints['Status'].unique().shape

The value_counts method will tally up the number of times a value appears in a column, and will return a Series with the counts, in descending order.

In [None]:
complaints['Status'].value_counts()

In [None]:
complaints['Complaint Type'][:10]

## Pandas visualization

We can also do some plotting of the data without much effort:

Import matplotlib package

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

As we have geo data of complaints represented as a pair of latitude and longitude, we can easily plot the complaints location.

In [None]:
complaints.plot(kind='scatter', x='Longitude', y='Latitude')

## Exersices

Read the file nyc_complaints_data_exercises.csv to DataFrame and answer the following questions

Print the shape of Dataframe

Print first 5 rows of DataFrame

How can we find the earliest created date in DataFrame?

How can we find the latest created date in DataFrame?

How can we print just the column containing street names?

How can we get a list of boroughs in the DataFrame, without duplicates?

What is the largest NYC Agency responsible for complaints?

How can we compute number of complaints per hour?

How can we compute a Boolean array indicating whether the Agency is 'New York City Police Department'?

How can we create a new DataFrame containing only the 'New York City Police Department' records? Print 5 last rows of this new DataFrame

How can we use row and column indexing to set the status to closed in the second row of DataFrame?

How can we print the DataFrame, sorted by Status and by Complaint type?

How can we find the boroughs where the most complaints are about illegal parking at 9am?

This solution uses an & operator to set two conditions that must both be met: