# 2. Working with Data in Python

[Pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

In [None]:
import pandas as pd

In [None]:
# loading data
df_iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

In [None]:
# inspecting data
df_iris.head(10)

In [None]:
df_iris['sepal_length']

Pandas is aware of the data types in the columns of your DataFrame. It is also aware of null and NaN ('Not-a-Number') types which often indicate missing data. In this exercise, we have imported pandas as pd and read the world population data into a DataFrame df which contains some NaN values — a value often used as a place-holder for missing or otherwise invalid data entries.

In [None]:
df_iris.info()

In [None]:
# NumPy and pandas working together
# df_iris.values

## 2.1 What is NumPy?

[numpy cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)

NumPy is a python library used for working with arrays.


In Python we have lists that serve the purpose of arrays, but they are slow to process.

NumPy aims to provide an array object that is up to 50x faster that traditional Python lists.

The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy.

Arrays are very frequently used in data science, where speed and resources are very important.



In [None]:
import numpy as np

In [None]:
arr = np.array([1, 2, 3, 4])
# np.array(l)

In [None]:
print(type(arr))

### 2.1.1 Dimensions in Arrays

A dimension in arrays is one level of array depth (nested arrays).



In [None]:
# 1-D array
arr = np.array([1, 2, 3, 4])
print(arr.size, arr.ndim)

In [None]:
# 2-D array
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.size, arr.ndim)

In [None]:
print(arr)

In [None]:
arr[0]
arr[0,1]

In [None]:
# 3-D array
arr = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

print(arr, '\n')
print(arr.size, arr.ndim)

In [None]:
# higher dimension arrays
arr = np.array([1, 2, 3, 4], ndmin=5)
print(arr)

In this array the innermost dimension (5th dim) has 4 elements, the 4th dim has 1 element that is the vector, the 3rd dim has 1 element that is the matrix with the vector, the 2nd dim has 1 element that is 3D array and 1st dim has 1 element that is a 4D array.



In [None]:
# accessing elements + slicing 1-D
arr = np.array([1, 2, 3, 4, 5, 6, 7])
arr[4]

In [None]:
arr[4:]

In [None]:
arr[2:5]

In [None]:
arr[2:5:]

In [None]:
# accessing elements + slicing 1-D
arr = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]])

arr[1, 1:4]

In [None]:
# data type in arrays
np.array(['dog', 'cat'])
# np.array(['dog', 'cat', 2])

In [None]:
# np.array(['dog', 'cat', [6, 7, 8, 9, 10]])

In [None]:
# type conversion
arr = np.array([1, 0, 3])
newarr = arr.astype(bool)

print(newarr)
print(newarr.dtype)

## Back to Dataframes

In [None]:
df_iris.columns

## 2.2 Building dataframes 

In [None]:
states = ['NY']
cities = ['New York']

# Construct a dictionary: data
data = {'state':states, 'city':cities}

In [None]:
data

In [None]:
df = pd.DataFrame(data)

In [None]:
df

In [None]:
# exporting 
df.to_csv('./class_03.csv')

## 2.3 Plotting with Pandas

In [None]:
df_iris.plot()

In [None]:
# !pip install matplotlib --yes

In [None]:
df_iris['sepal_length'].plot()

# 3. Visualization with Matplotlib

from A. Mueller [notebook](https://github.com/amueller/COMS4995-s19/blob/master/slides/aml-03-matplotlib/aml-03-012517.ipynb)

## 3.1 Matplotlib

[cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf)

In [None]:
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
ax11 = plt.subplot(2, 2, 1)
ax21 = plt.subplot(2, 2, 2)
ax12 = plt.subplot(2, 2, 3)
ax22 = plt.subplot(2, 2, 4)

ax11.set_title("ax11")
ax21.set_title("ax21")
ax12.set_title("ax12")
ax22.set_title("ax22")
plt.show()

Three integers (nrows, ncols, index). The subplot will take the index position on a grid with nrows rows and ncols columns. index starts at 1 in the upper left corner and increases to the right. index can also be a two-tuple specifying the (first, last) indices (1-based, and including last) of the subplot, e.g., fig.add_subplot(3, 1, (1, 2)) makes a subplot that spans the upper 2/3 of the figure.

In [None]:
plt.figure()
ax11 = plt.subplot(2, 2, 1)
ax21 = plt.subplot(2, 2, 2)
ax2 = plt.subplot(2, 1, 2)
ax11.set_title("ax11")
ax21.set_title("ax21")
ax2.set_title("ax2")

In [None]:
import numpy as np
sin = np.sin(np.linspace(-4, 4, 100))
fig, axes = plt.subplots(2, 2)
plt.plot(sin)

In [None]:
fig, axes = plt.subplots(2, 2)
axes[0, 0].plot(sin)

In [None]:
df_iris['species'].unique()

In [None]:
plt.figure()
plt.bar(range(df_iris['species'].nunique()), 
        df_iris['species'].value_counts())

plt.xticks(range(df_iris['species'].nunique()),
           df_iris['species'].unique(), rotation=0)



plt.tight_layout()
plt.savefig("images/matplotlib_bar", bbox_inches="tight", dpi=300)

In [None]:
df_iris['species'].value_counts()

In [None]:
df_iris['species'].

[Matplotlib Cheat Sheet](https://www.datacamp.com/community/blog/python-matplotlib-cheat-sheet)

## 3.2 Seaborn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

In [None]:
# %pip install seaborn

In [None]:
tips = sns.load_dataset("tips")
sns.relplot(x="total_bill", y="tip", data=tips)

In [None]:
df = pd.DataFrame(dict(time=np.arange(500),
                       value=np.random.randn(500).cumsum()))
g = sns.relplot(x="time", y="value", kind="line", data=df)
g.fig.autofmt_xdate()

More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean:

In [None]:
fmri = sns.load_dataset("fmri")
sns.relplot(x="timepoint", y="signal", kind="line", data=fmri);

The lineplot() function has the same flexibility as scatterplot(): it can show up to three additional variables by modifying the hue, size, and style of the plot elements. It does so using the same API as scatterplot(), meaning that we don’t need to stop and think about the parameters that control the look of lines vs. points in matplotlib.

In [None]:
sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=fmri);


Line plots are often used to visualize data associated with real dates and times. These functions pass the data down in their original format to the underlying matplotlib functions, and so they can take advantage of matplotlib’s ability to format dates in tick labels. But all of that formatting will have to take place at the matplotlib layer, and you should refer to the matplotlib documentation to see how it works:



In [None]:
df = pd.DataFrame(dict(time=pd.date_range("2017-1-1", periods=500),
                       value=np.random.randn(500).cumsum()))
g = sns.relplot(x="time", y="value", kind="line", data=df)
g.fig.autofmt_xdate()

### Plotting univariate distributions¶

When dealing with a set of data, often the first thing you’ll want to do is get a sense for how the variables are distributed.

The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

In [None]:
x = np.random.normal(size=100)
sns.distplot(x);

### Plotting bivariate distributions¶

It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the jointplot() function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

In [None]:
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])


In [None]:
sns.jointplot(x="x", y="y", data=df);