# Python Lab Exercise #2

## Objectives:

- Load .csv files into `pandas` DataFrames
- Describe and manipulate data in Series and DataFrames
- Visualize data using DataFrame methods and `matplotlib`

![pandas](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2880px-Pandas_logo.svg.png)

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sqlite3

## What is Pandas?

Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful.

Let's read in the heart dataset.

Pandas has many methods for reading different types of files. Note that here we have a .csv file.

Read about this dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci).

In [134]:
# loading the data from the database python.db
import pandas as pd
import sqlite3

con = sqlite3.connect('python.db')
heart_df = pd.read_sql_query('SELECT * FROM heart', con)
heart_df.to_sql('heart_data', con, if_exists='replace', index=False)




DatabaseError: Execution failed on sql 'SELECT * FROM heart': no such table: heart

In [None]:
The output of the `.to_sql()` function is a pandas *DataFrame*, which has a familiar tabaular structure of rows and columns.

In [6]:
type(heart_df)

NameError: name 'heart_df' is not defined

## DataFrames and Series

Two main types of pandas objects are the DataFrame and the Series, the latter being in effect a single column of the former:

In [35]:
age_series = heart_df['age']
type(age_series)

TypeError: 'sqlite3.Connection' object is not subscriptable

Notice how we can isolate a column of our DataFrame simply by using square brackets together with the name of the column.

Both Series and DataFrames have an *index* as well:

In [33]:
heart_df.index

AttributeError: 'sqlite3.Connection' object has no attribute 'index'

In [None]:
age_series.index

Pandas is built on top of NumPy, and we can always access the NumPy array underlying a DataFrame using `.values`.

In [None]:
heart_df.values

## Basic DataFrame Attributes and Methods

### `.head()`

In [14]:
# complete the python code here
.head()

SyntaxError: invalid syntax (775028662.py, line 2)

### `.tail()`

In [None]:
# complete the python code here


### `.info()`

In [None]:
# complete the python code here


### `.describe()`

In [None]:
# complete the python code here


### `.dtypes`

In [None]:
# complete the python code here


### `.shape`

In [None]:
# complete the python code here


### Exploratory Plots

Let's make ourselves a histogram of ages:

In [None]:
sns.set_style('darkgrid')
sns.distplot(a=heart_df['age']);

# For more recent versions of seaborn:
# sns.histplot(data=heart_df['age'], kde=True);

And while we're at it let's do a scatter plot of maximum heart rate vs. age:

In [None]:
sns.scatterplot(x=heart_df['age'], y=heart_df['thalach']);

## Adding to a DataFrame

### Adding Rows

Here are two rows that our engineer accidentally left out of the .csv file, expressed as a Python dictionary:

In [None]:
extra_rows = {'age': [40, 30], 
              'sex': [1, 0], 
              'cp': [0, 0], 
              'trestbps': [120, 130],
              'chol': [240, 200],
              'fbs': [0, 0], 
              'restecg': [1, 0], 
              'thalach': [120, 122], 
              'exang': [0, 1],
              'oldpeak': [0.1, 1.0], 
              'slope': [1, 1], 
              'ca': [0, 1], 
              'thal': [2, 3],
              'target': [0, 0]}
extra_rows

How can we add this to the bottom of our dataset?

In [None]:
# Let's first turn this into a DataFrame.
# We can use the .from_dict() method.

missing = pd.DataFrame(extra_rows)
missing

In [None]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.
# complete the python code here



In [None]:
# Let's check the end to make sure we were successful!
# complete the python code here



### Adding Columns

Adding a column is very easy in `pandas`. Let's add a new column to our dataset called "test", and set all of its values to 0.

In [None]:
heart_augmented['test'] = 0

In [None]:
heart_augmented.head()

I can also add columns whose values are functions of existing columns.

Suppose I want to add the cholesterol column ("chol") to the resting systolic blood pressure column ("trestbps"):

In [None]:
# complete the python code here



In [None]:
heart_augmented.head()

## Filtering

We can use filtering techniques to see only certain rows of our data. If we wanted to see only the rows for patients 70 years of age or older, we can simply type:

In [None]:
heart_augmented['age'] >= 70

In [None]:
heart_augmented[heart_augmented['age'] >= 70]

Use '&' for "and" and '|' for "or".

### Exercise

Display the patients who are 70 or over as well as the patients whose trestbps score is greater than 170.

In [None]:
# complete the python code here



### Exploratory Plot

Using the subframe we just made, let's make a scatter plot of their cholesterol levels vs. age and color by sex:

In [None]:
# complete the python code here
at_risk = 

sns.scatterplot(data=at_risk, x='age', y='chol', hue='sex');

### `.loc` and `.iloc`

We can use `.loc` to get, say, the first ten values of the age and resting blood pressure ("trestbps") columns:

In [None]:
heart_augmented.loc

In [None]:
heart_augmented.loc[:9, ['age', 'trestbps']]

`.iloc` is used for selecting locations in the DataFrame **by number**:

In [None]:
heart_augmented.iloc

In [None]:
heart_augmented.iloc[3, 0]

In [None]:
heart_augmented.head()

### Exercise

How would we get the same slice as just above by using .iloc() instead of .loc()?

In [None]:
# complete the python code here



## Statistics

### `.mean()`

In [None]:
# complete the python code here



Be careful! Some of these will are not straightforwardly interpretable. What does an average "sex" of 0.682 mean?

### `.min()`

In [None]:
# complete the python code here



### `.max()`

In [None]:
# complete the python code here



## Series Methods

### `.value_counts()`

How many different values does slope have? What about sex? And target?

In [None]:
heart_augmented['slope'].value_counts()

In [None]:
heart_augmented['sex'].value_counts()

### `.sort_values()`

In [None]:
heart_augmented['age'].sort_values()

## `pandas`-Native Plotting

The `.plot()` and `.hist()` methods available for DataFrames use a wrapper around `matplotlib`:

In [None]:
heart_augmented.plot(x='age', y='trestbps', kind='scatter');

In [None]:
heart_augmented.hist(column='chol');

### Exercises

1. Make a bar plot of "age" vs. "slope" for the `heart_augmented` DataFrame.

In [None]:
# complete the python code here



2. Make a histogram of ages for **just the men** in `heart_augmented` (heart_augmented['sex']=1).

In [None]:
# complete the python code here



3. Make separate scatter plots of cholesterol vs. resting systolic blood pressure for the target=0 and the target=1 groups. Put both plots on the same figure and give each an appropriate title.

In [None]:
# complete the python code here

