# Python Foundations of Data Science

In this class, we will learn to work with data in Python. We will learn to import and manipulate some files. Go ahead and hit `Shift + Enter` to run each cell.

## Pandas

Pandas is the workhorse of data science in Python. It is used for reading in a majority of file formats. It is also used for working with the imported files. You will need to import it. In Python, it is frequently aliased as pd.

In [None]:
import pandas as pd

Having imported Pandas, we may proceed to import our files. We will start by importing a CSV, which is a comma-separated values text file. This is a frequently used file format for data interchange because it saves everyone the stress of having to worry about what database the others are using, or what format their data is stored in. The method to read CSV files is `.read_csv()`

In [None]:
file_df = pd.read_csv('./assets/Boston/train.csv')

Pandas will read in the file and return a DataFrame which we can manipulate. We can take a look at the file, using the `.head()` method.

In [None]:
file_df.head()

`.head()` can take a number of parameters, such as how many rows to show. The file has an ID column, which we might want to use as the index of the records we will be working with. Let's proceed to reload our file and show 7 records.

In [None]:
file_df = pd.read_csv('./assets/Boston/train.csv', index_col='ID')
file_df.head(n=7)

We might want to get a list of what column names and types we have loaded into our DataFrame. We can do that with the `.info()` method.

In [None]:
file_df.info()

The method shows us what column types we are working with, as well as how many records in that column are not null. This is important to know because we can't work with null values.

When dealing with numeric values, it is good to get some summary statistics of the columns. This helps with a sanity check. We can do this with `.describe()`. This will give you the mean, standard deviation, minimum, maximum, 25th, 50th, and 75th percentile for each numeric column.

In [None]:
file_df.describe()

There are times when your file is not comma separated. Some files are separated by tabs, |, or some other delimiter. `.read_csv()` let's you specify the separator when you make the call. You can specify whether or not the file you are reading has column headers in the first row, whether there is an index column, what to do with null values, and a lot more. The documentation is available here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

You can read in JSON files using `.read_json()` and Excel files with `read_excel()`.

### Saving Files
After working on a DataFrame, you might have a need to save it for later use. You can use `.to_csv()` to save a file in CSV format. It supports a few options, so you will want to read the documentation.

In [None]:
file_df.to_csv('./output.csv')

### Creating DataFrames

You can create a dataframe from a dictionary.

In [None]:
my_dict = {'col_a': [1, 2, 3], 'col_b': ['a', 'b', 'c']}
my_df = pd.DataFrame(my_dict)
my_df

You can also create a DataFrame from a list

In [None]:
my_list = ['item 1', 'item 2', 'item 3']
my_df = pd.DataFrame(my_list)
my_df

### Slicing DataFrames
You can take subsets from a DataFrame. Let's revisit our imported file

In [None]:
file_df.head()

We can create a new DataFrame by using two square brackets and naming the columns we need

In [None]:
new_df = file_df[['lstat', 'crim', 'medv']]
new_df.head()

We can extract a single column as a Pandas Series. While a DataFrame is a two-dimensional object, a `Series` is a one-dimensional object.

In [None]:
medv = file_df['medv']
medv

We can check the type of the series

In [None]:
print(type(medv))

We can get a Numpy array out of a Series object using `.values`

In [None]:
print(type(medv.values))

In [None]:
medv.values

You can extract and view the index of the DataFrame

In [None]:
my_idx = file_df.index
my_idx

You can sort by column names (axis=1) in either ascending or descending order

In [None]:
file_df.sort_index(axis=1, ascending=False)

You can sort by the contents of a column 

In [None]:
file_df.sort_values(by='medv')

## Types

Data types are implicitely determined

In [None]:
a = 5.0
b = 2
c = 'Hello'

You can use `type()` to find the type of an object

In [None]:
print(type(a))

In [None]:
print(type(b))

In [None]:
print(type(c))

## Arithmetic

You can carry out arithmetic operations on objects

In [None]:
print(a + b)

In [None]:
c = a + b
print(c ** 2)

## Strings

In [None]:
a = 'My'
b = 'Name'

You can concatenate strings using addition

In [None]:
c = a + b
print(c)

You can use string formatting

In [None]:
d = 'Amos'
print('His name is {}'.format(d))

## Python Lists

Useful for storing objects of the same type (and meaning). For example, names of friends. Elements are zero-indexed.

In [None]:
friends = ['Atkins', 'Bruce', 'Chang']
print('The first name on my list is {}'.format(friends[0]))

Lists can contain any object type, as well as other lists:

In [None]:
friends = [ ['Atkins', 'atkins@friends.me'], ['Bruce', 'bruce@others.com']]
print(friends)

You can index lists from the front starting from `0` or from the back starting with `-1`

In [None]:
temps = [22.0, 23.5, 23.0, 22.5, 22.0, 21.0, 22.5]
print('The first day had a temperature of {}'.format(temps[0]))
print('The second to last day had a temperature of {}'.format(temps[-2]))

You can extract a subset by slicing. You specify the start index and the end index separated by a `:`. The result includes the start index **but** excludes the end index.

In [None]:
print('The temperatures of the second and third days are: {}'.format(temps[2:4]))

You can edit a list element directly

In [None]:
print('Original Temps: {}'.format(temps))
temps[0] = 21.0
print('New Temps: {}'.format(temps))

You can concatenate lists

In [None]:
more_temps = [22.5, 24.0, 24.5]
print(temps + more_temps)

You can copy lists using `list()` or slicing

In [None]:
t_1 = temps[:]
t_2 = list(temps)
print(t_1)
print(t_2)

## Numpy
Numeric Python

Recall that Python Lists are:
* Powerful
* Collection of values
* Values can be of different types
* Mutable: you can change, add, and remove

However, lists are not good for data science! Two things are responsible for this:
* Operations on lists are not fast enough
* Mathematical operations can't be carried out on lists in an optimal fashion.

**Numpy**

The alternative to Pyhon Lists.
* Supports calculations over entire arrays
* Fast

In [None]:
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

In [None]:
import numpy as np

np_height = np.array(height)
np_weight = np.array(weight)

bmi = np_weight / np_height ** 2
print(bmi)

Numpy operations are carried out element-wise. The arrays need to be of the same size.

**Numpy arrays must contain only one type**

Numpy arrays support sub-setting

In [None]:
print(bmi[1])

In [None]:
print(bmi[:2])

They support boolean operations

In [None]:
print(bmi > 21)

They support boolean indexing

In [None]:
print(bmi[bmi > 21])

Numpy arrays support nesting. Note the number of brackets

In [None]:
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
                  [65.4, 59.2, 63.6, 88.4, 68.7]])
print(np_2d)

In [None]:
print(np_2d[0])

In [None]:
print(np_2d[0][2])

In [None]:
print(np_2d[0,2])

In [None]:
print(np_2d[:,2])

**Numpy supports statistical operations**

In [None]:
print(np.mean(np_2d[0]))

In [None]:
print(np.std(np_2d[0]))

In [None]:
print(np.median(np_2d[0]))

In [None]:
print(np.corrcoef(np_2d[0], np_2d[1]))

In [None]:
print(np.sum(np_2d[0]))

**There is even a random number generator**

In [None]:
height = np.round(np.random.normal(1.75, 0.20, 5000), 2) # normal() takes mean, standard deviation, and quantity
weight = np.round(np.random.normal(60.32, 15, 5000), 2)

np_city = np.column_stack((height, weight))
np_city

# Basic Plotting

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.192, 4.263, 6.972]

#line plot
_ = plt.plot(year, pop)
plt.show()

In [None]:
#scatter plot
_ = plt.scatter(year, pop)
plt.show()

In [None]:
#bar plot
_ = plt.bar(year, pop)
plt.show()

In [None]:
# histogram
values = [0,0.6,1.4,1.6,2.2,2.5,2.6,3.2,3.5,3.9,4.2,6]
_ = plt.hist(values)
plt.show()

In [None]:
# bin the histogram
_ = plt.hist(values, bins=5)
plt.show()

In [None]:
# label axes
_ = plt.plot(year, pop)
_ = plt.xlabel('year')
_ = plt.title('World Population')
_ = plt.ylabel('pop')
plt.show()

# Data Wrangling

This is the process of transforming data from one form into another. Concretely, this might involve:
* Locating missing data
* Changing data from one type to another
* Merging two or more DataFrames
* Deleting or dropping some data